Scalable Structure Learning for Probabilistic Soft Logic

07/03/2018
by   Varun Embar, et al.
University of California Santa Cruz
4

Statistical relational frameworks such as Markov logic networks and probabilistic soft logic (PSL) encode model structure with weighted first-order logical clauses. Learning these clauses from data is referred to as structure learning. Structure learning alleviates the manual cost of specifying models. However, this benefit comes with high computational costs; structure learning typically requires an expensive search over the space of clauses which involves repeated optimization of clause weights. In this paper, we propose the first two approaches to structure learning for PSL. We introduce a greedy search-based algorithm and a novel optimization method that trade-off scalability and approximations to the structure learning problem in varying ways. The highly scalable optimization method combines data-driven generation of clauses with a piecewise pseudolikelihood (PPLL) objective that learns model structure by optimizing clause weights only once. We compare both methods across five real-world tasks, showing that PPLL achieves an order of magnitude runtime speedup and AUC gains up to 15

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/09/2013

Structure Learning of Probabilistic Logic Programs by Searching the Clause Space

Learning probabilistic logic programming languages is receiving an incre...
05/19/2017

Induction of Interpretable Possibilistic Logic Theories from Relational Data

The field of Statistical Relational Learning (SRL) is concerned with lea...
06/28/2016

On the Semantic Relationship between Probabilistic Soft Logic and Markov Logic

Markov Logic Networks (MLN) and Probabilistic Soft Logic (PSL) are widel...
05/17/2015

Hinge-Loss Markov Random Fields and Probabilistic Soft Logic

A fundamental challenge in developing high-impact machine learning techn...
02/25/2014

Inductive Logic Boosting

Recent years have seen a surge of interest in Probabilistic Logic Progra...
06/17/2019

Neurally-Guided Structure Inference

Most structure inference methods either rely on exhaustive search or are...
09/22/2020

Learning Concepts Described by Weight Aggregation Logic

We consider weighted structures, which extend ordinary relational struct...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical relational learning (SRL) methods combine probabilistic reasoning with knowledge representations that capture the structure in problem domains. Markov logic networks (MLN) [Richardson and Domingos2006] and probabilistic soft logic (PSL) [Bach et al.2017] are notable SRL frameworks that define model structure with weighted first-order logic. However, specifying logical clauses for each problem is laborious and requires domain knowledge. The task of discovering these weighted clauses from data is referred to as structure learning, and has been well-studied for MLNs [Kok and Domingos2005, Kok and Domingos2009, Kok and Domingos2010, Mihalkova and Mooney2007, Biba, Ferilli, and Esposito2008, Huynh and Mooney2008, Khosravi et al.2010, Khot et al.2015]. The extensive related work for MLNs underscores the importance of structure learning for SRL.

Structure learning approaches alleviate the cost of model discovery. However, they face several critical computational challenges. First, even when the model space is restricted to be finite, it results in a combinatorial search. Second, heuristic approaches that iteratively refine and grow a set of rules require interleaving of several costly rounds of parameter estimation and scoring. Finally, scoring the model often involves computing the model likelihood which is typically intractable to evaluate exactly.

Structure learning approaches for MLNs vary in the degree to which they address these scalability challenges. An efficient and extensible class of MLN structure learning algorithms adopt a bottom-up strategy, mining patterns and motifs from training data to generate informative clauses [Mihalkova and Mooney2007, Kok and Domingos2009, Kok and Domingos2010]. The data-driven heuristics reduce the search space to useful clauses but still interleave rounds of parameter estimation and scoring, which is expensive for SRL methods.

Motivated by the success of structure learning for MLNs, in this paper, we formalize the structure learning problem for PSL. We extend the data-driven approach to generating clauses and propose two contrasting PSL structure learning methods that differ in scalability and choice of approximations. We build on path-constrained relational random walk methods [Lao and Cohen2010, Gardner et al.2013] to generate clauses that capture patterns in the data. To find the best set of clauses, we introduce a greedy search-based algorithm and an optimization method that uses a piecewise pseudolikelihood (PPLL) objective function. PPLL decomposes the search over clauses into a single optimization over clause weights that is solved with an efficient parallel algorithm. Our proposed PPLL approach addresses the scalability challenges of structure learning and its formulation can be easily extended to other SRL techniques, including MLNs. In this paper, our key technical contributions are to:

  • formulate path-constrained clause generation that efficiently finds relational patterns in the data.

  • propose greedy search and PPLL methods that select the best path-constrained clauses by trading off scalability and approximations for structure learning.

  • validate the predictive performance and runtimes of both methods with real-world tasks in biological paper recommendation, drug interaction prediction and knowledge base completion.

We compare both proposed PSL structure learning methods and show that our novel PPLL method achieves an order of magnitude runtime speedup and AUC improvements of up to 15% over the greedy search method.

2 Background

We briefly review of structure learning for statistical relational learning (SRL) and probabilistic soft logic (PSL), the framework for which we propose structure learning approaches.

2.1 Structure Learning for SRL

Our work focuses on SRL methods such as MLNs and PSL that encode dependencies with first-order logic. Below, we formalize the joint distributions defined using logical clauses before outlining structure learning for these methods.

An atom consists of a predicate (e.g. Works, Lives) over constants (e.g. Alice, Bob) or variables (e.g. ). An atom whose predicate arguments are all constants is a ground atom. A literal is an atom or its negation. A clause is a formula where and are literals. Given clauses and real-valued weights , a model is a set of clause and weight pairs.

Given constants from a domain, we substitute the variables appearing in literals over with these constants to obtain a set of ground clauses for each clause . The corresponding set of ground atoms is where each

is a random variable with assignments

. The model defines a distribution over as:

(1)

Each instantiated from a clause is a function over assignments to that returns 0 if is satisfied by

values and 1 otherwise. Intuitively, assignments that satisfy more ground rules are exponentially more probable.

The problem of structure learning finds the model which best fits a set of observed assignments , regularized by model complexity. We denote the set of possible clauses as the language . Although can be infinite, it is standard to impose restrictions that make finite for structure learning. Formally, the structure learning problem finds that maximize a regularized log likelihood function given observed assignments:

(2)

where represents priors on the weights and structure. Typical choices for combine a Gaussian prior on weights and an exponential prior on clause length.

The log likelihood requires an exponential sum to compute and the optimization combines a combinatorial search over with a maximization of continuous weights (called weight learning). Consequently, solving structure learning requires further approximations to search and scoring. Approaches to structure learning broadly interleave two key components: clause generation and model evaluation, or scoring. The clause generation phase produces a candidate language over which to search. In practice, is a subset of all possible clauses, chosen to restrict the search to useful regions of the space. Model evaluation typically iteratively refines the existing model by learning and scoring candidate clauses in using approximations to .

2.2 Probabilistic Soft Logic

Probabilistic soft logic (PSL) is a SRL framework that defines hinge-loss Markov random fields, a special class of the undirected graphical model given by eq:srlpdf. HL-MRFs are conditional distributions over real-valued atom assignments in and apply a continuous relaxation of Boolean logic to the ground clauses to derive of the form:

(3)

where and denote the set of non-negated and negated ground atoms in the clause and . In contrast to ground Boolean clauses that are satisfied or violated, a ground clause in soft logic returns a continuous distance to satisfaction. Intuitively, corresponds to a linear or quadratic penalty for violating clause .

PSL defines distributions over the target variables for a particular task conditioned on the remaining evidence variables. Formally, given a set of target predicates , a PSL model consists of non-negative weights and disjunctive clauses where the predicate for literal belongs to . Given a set of atoms where random variable and a set of evidence atoms where each is an observed variable, a PSL model defines an HL-MRF distribution of the form:

(4)

PSL has been successfully applied to many problem including natural language processing

[Beltagy, Erk, and Mooney2014], social media analysis [Johnson and Goldwasser2016, Ebrahimi, Dou, and Lowd2016] and information extraction [Platanios et al.2017].

3 Structure Learning for PSL

Given target predicates , structure learning for PSL finds a model to infer . We denote language space for PSL , which is restricted to clauses of the form . We again constrain to be finite. To overcome the intractable likelihood score, pseudo-likelihood [Besag1975] is an approximation that is commonly used across SRL structure learning and weight learning methods. For HL-MRFs, the pseudo-likelihood approximates the likelihood as:

(5)

The notation selects ground clauses where appears.

Given target predicates , real-valued variable assignments and where each atom consists of , following the objective in Equation 2, structure learning for PSL maximizes log pseudolikelihood :

(6)

where denotes all ground rules that can be instantiated from clauses . In the next section, we propose two approaches to the structure learning problem for HL-MRFs that rely on an efficient clause generation algorithm.

4 Approaches to PSL Structure Learning

To formulate PSL structure learning algorithms, we introduce approaches for both key method components: clause generation and model evaluation. We outline an efficient algorithm for data-driven clause generation. For model evaluation over these clauses, we first propose a straightforward greedy local search algorithm (GLS). To improve upon the computationally expensive search-based approach, we introduce a novel optimization approach, piecewise pseudo-likelihood (PPLL). PPLL unifies the efficient clause generation with a surrogate convex objective that can be optimized exactly and in parallel.

4.1 Path-Constrained Clause Generation

The clause generation phase of structure learning outputs the language of first-order logic clauses over which to search. Driven by relational random walk methods used for information retrieval tasks [Lao, Mitchell, and Cohen2011, Gardner et al.2013], we formulate a special class of path-constrained clauses that capture relational patterns in the data. Path-constrained clause generation is also related to the pre-processing steps in bottom-up structure learning methods [Mihalkova and Mooney2007, Kok and Domingos2009, Kok and Domingos2010]. Bottom-up methods typically use relational paths as heuristics to cluster predicates into templates and enumerate all clauses that contain predicate literals from the same template. The structure learning algorithm greedily selects from these clauses. Path-constrained clause generation also produces prior to structure learning. Here, we use a breadth-first traversal algorithm which directly generates informative path-constrained clauses by variablizing relational paths in the data.

The inputs to path-constrained clause generation are the ground atoms of a domain, the set of all predicates and target predicate . In this work, we consider predicates with arity of two but our approach will be extended to support predicates with arity three and higher. We begin with a running example that illustrates the definitions below. Consider a ground atom set with Cites(Paper1, Paper2), Mentions(Paper2, Gene), Mentions(Paper1, Gene) and . In this simple example, all ground atoms have an assignment of 1. In general, real-valued assignments to atoms must be rounded to 0 or 1 during path-constrained clause generation.

A target relational path for denoted is defined by an ordered list of ground atoms such that each , its last argument is the first argument of , and is a target atom. Given a target relational path , the corresponding first-order path-constrained clause has the form where each is a logical variable and the -th literal in the clause variablizes the -th atom in . The negation of is the clause with , the target predicate literal negated.

For eg:reldata, given target relational path [Cites(Paper1, Paper2), Mentions(Paper2, Gene), Mentions(Paper1, Gene)], we obtain the first-order path-constrained clause:

We generate the set of all possible path-constrained clauses up to length , by performing breadth-first search (BFS) of up to depth from the first argument of each target atom . A connected BFS search tree for training example is rooted at and one of its leaf nodes must be . Every non-leaf constant in has child entities connected by ground atoms . For eg:reldata, the connected BFS search tree of depth for target atom Mentions(Paper1, Gene) is:

Given a tree , each path from its root to leaf node is a target relational path . For target predicate , is the set of connected BFS search trees corresponding to all target atoms. For all , we enumerate all such from each and obtain the unique set of these paths . For each , we form the corresponding path-constrained clause and its negation to obtain all such clauses . Moreover, we can further restrict to those clauses that connect target atoms, preferring clauses that cover, or explain, at least training examples. The language defined by guides the search over models that capture informative relational patterns in the data. Although produces only Horn clauses and is thus a subset of the language [Kazemi and Poole2018], it has been successfully used in several relational learning tasks [Lao and Cohen2010, Gardner et al.2013]. While our path-constrained clause generation performs well in the tasks we study, where needed, we will explore more expressive strategies.

4.2 Greedy Local Search

Given path-constrained clauses, exactly maximizing the pseudolikelihood objective given by eq:pseudolikelihood requires evaluating subsets of clauses, which is already infeasible with only 100 clauses. Instead, we propose an approximate greedy search algorithm that selects locally optimal clauses in each iteration to maximize pseudolikelihood.

: path-constrained clauses; : tolerance; : max iterations
: optimal clauses and weights
while  or  do
     
     for  do
         
         
         if  then
              
                        
               
     
     
     
Algorithm 1 Greedy Local Search (GLS)

alg:local gives the pseudocode for greedy local search (GLS) which approximately maximizes the pseudolikelihood score . GLS iteratively picks the that maximizes and adds it to the model until the score has only improved by or a maximum number of iterations has been reached. While GLS is straightforward to implement, it requires rounds of weight learning and evaluating where denotes the size of . As grows, the GLS becomes prohibitively expensive unless we sacrifice performance by increasing or decreasing . To overcome the scalability pitfalls of GLS and search-based methods at large, we introduce a new structure learning objective that can be optimized efficiently and exactly.

4.3 Piecewise Pseudolikelihood

The partition function in pseudo-likelihood involves an integration that couples all model clauses. Optimizing pseudo-likelihood requires evaluating all subsets of the language , necessitating greedy approximations to the combinatorial problem. To overcome this computational bottleneck, we propose a new, efficient-to-optimize objective function called piecewise pseudolikelihood (PPLL). Below, we derive two key results which have significant consequences for scalability of structure learning: 1) with PPLL, structure learning is solved by performing weight learning once; and 2) the factorization used by PPLL admits an inherently parallelizable gradient-based algorithm for optimization.

PPLL was first proposed for weight learning in conditional random fields (CRF) [Sutton and McCallum2007]. For HL-MRFs, PPLL factorizes the joint conditional distribution along both random variables and clauses and is defined as:

(7)

The key advantage of PPLL over pseudo-likelihood arises from the factorization of into , which requires only clause and variable for its computation.

Following standard convention for structure learning, we optimize the log of PPLL denoted . We highlight a connection between PPLL and pseudolikelihood that is useful in deriving the two key scalability results of PPLL. The product of terms in PPLL corresponding to clause is the log pseudo-likelihood of the model containing only clause . We denote this :

(8)

We now show that for the log PPLL objective function, performing weight learning on the model containing all clauses in is equivalent to optimizing the objective function over the space of all models. Formally:

(9)

Optimizing over the set of weights w is equivalent to optimizing over each separately.

Proof.

Each is a function of only . By definition of , we have

For PPLL, maximizing the weights of the model containing all clauses in is equivalent to optimizing the structure learning objective.

Proof.

As a result of thm:ppll, instead of combinatorial search, we perform a simpler continuous optimization over weights that can be solved efficiently. Since the objective is convex, and the weights are non-negative, we optimize the above objective using projected gradient descent.

The projected gradient descent algorithm for optimizing the objective function is shown in alg:piecewise. The partial derivative of for a given weight is of the form:

(10)

The gradient for any weight is the difference between observed and expected penalties summed over corresponding ground clauses . For both pseudo-likelihood and PPLL, we can compute observed penalties once and cache their values but the repeated expected value computations, even for a one-dimensional integral, remain costly. However, unlike the gradients for pseudo-likelihood, each expectation term in the PPLL gradient considers a single clause. Thus, when evaluating gradients for weight updates in alg:piecewise, we use multi-threading to compute the expectation terms in parallel. The dual advantages of parallelizing and requiring weight learning only once makes PPLL highly scalable. After convergence of the gradient descent procedure, we return the set of clauses with non-zero weights as the final model.

: path-constrained clauses; : tolerance; : max iterations; : step size
: optimal clauses and weights
for  do
     
while  or  do
     
     for  do
         
         if   then
                             
     
     
for  do
     if   then
               
Algorithm 2 Piecewise Pseudolikelihood (PPLL)

5 Experimental Evaluation

Method Fly-Gene Yeast-Gene DDI-Interacts Freebase-FilmRating Freebase-BookAuthor
GLS 0.95 0.01 0.86 0.02 0.66 0.01 0.65 0.04 0.67 0.03
PPLL 0.97 0.002 0.90 0.003 0.76 0.01 0.65 0.05 0.65 0.04
Table 1: Average AUC of methods across five prediction tasks. Bolded numbers are statistically significant at . We show that PPLL training improves over GLS in three out of five settings.
Figure 1: Running times (in seconds) in log scale on Freebase tasks. PPLL consistently scales more effectively than GLS.

The PPLL optimization method uses a fully factorized approximation for scalability while GLS greedily maximizes the less decoupled pseudolikelihood at the expense of speed. We explore the trade-offs made by these two methods by evaluating predictive performance and scalability. We investigate these experimental questions with five prediction tasks and compare PPLL against GLS after generating path-constrained clauses. The evaluation tasks include paper recommendation in biological citation networks, drug interaction prediction and knowledge base completion.

5.1 Datasets

For our datasets, we obtain citation networks for biological publications, drug-drug interaction pharmacological networks and knowledge graphs.

Biological Citation Networks

Our first dataset consists of biology-related papers and entities such as authors, venues, words, genes, proteins and chemical compounds [Lao, Mitchell, and Cohen2011]. The dataset includes relations over these entity types for two domains, “Fly” and “Yeast”, resulting in two citation networks. The prediction target is the Gene relation between genes and papers that mention them. To enforce training only on papers from the past, we partition papers into periods of time, using those from 2006 as observations, training on papers from 2007 and evaluating on papers from 2008. We randomly subsample targets to obtain 1500 train and test links, and generate five such random splits for cross-validation.

Drug-drug interaction

The second dataset we use includes chemical interactions between drug pairs, called drug-drug interactions (DDI) across 196 drug compounds obtained from the DrugBank database. This dataset also contains a directed graph of relations from Drugbank between these drugs and gene targets, enzymes, and transporters. Our target for prediction is the Interacts relation between drugs. We subsample the tens of thousands of labeled interaction and shuffle the remaining labeled DDI links into five folds for cross-validation. Each fold contains almost 2000 labeled DDI targets. We alternate using one fold of DDI edges as observations, one for training and one for held-out evaluation.

Freebase

Our third dataset comes from the Freebase knowledge graph and is well-used in validating knowledge base (KB) completion tasks [Gardner et al.2014]. We study KB completion for two relations: links between films and their ratings (FilmRating) and links between authors and books written (BookAuthor). The remaining relations in the KB are observed. For both target relations, we subsample edges and split the resultant edges into five folds for cross-validation, yielding 1000 labeled edges per fold.

5.2 Experimental Setup

Our first experimental question evaluates predictive performance using area under the ROC curve (AUC) on held-out data with five-fold cross-validation across the five tasks described above. Our second question validates scalability by comparing running-times for both methods as the number of clauses grows. For both methods, we use ADMM inference implemented in the probabilistic soft logic (PSL) framework [Bach et al.2017]. For GLS, we use the pseudo-likelihood learning algorithm in PSL and implement its corresponding scoring function within in PSL 111psl.linqs.org. For PPLL, we implement the parallelized learning algorithm in PSL. For all tasks, we enumerate target relational paths using the BFS utility in the Path Ranking Algorithm (PRA) 222github.com/matt-gardner/pra [Lao and Cohen2010, Gardner et al.2013, Gardner et al.2014] and generate path-constrained clauses from these paths. PRA generates and includes the inverses of all atoms when performing BFS. To form clause literals from these inverses, we use the original predicate and reverse the order of its variablized arguments.

As the number of generated clauses grows, GLS becomes prohibitive as we show in our scalability results and necessitates a clause-pruning strategy. We prune the set of clauses by retaining those that connect at least 10 target atoms and select the top 50 clauses by number of targets connected. For each target predicate in the prediction tasks detailed above, we also add a negative prior clause to the candidate clauses. For link prediction tasks, the negative prior captures the intuition that true positive links are rare and most links do not form. We refer the reader to [Bach et al.2017] for detailed discussion on the importance of negative priors. For the biological citation networks and Freebase settings, we subsample negative examples of the targets to mitigate the imbalance in labeled training data. We perform 150 iterations of gradient descent for PPLL and 15 for GLS since it requires several rounds of weight learning.

5.3 Predictive Performance

Our first experimental question investigates the ramifications for predictive performance of the approximations made by each method. PPLL approximates the likelihood by fully factorizing across clauses and target variables while GLS uses the pseudolikelihood approximation which still couples clauses. We examine whether the decoupling in PPLL limits its predictive performance. We generate path-constrained clauses as input to both methods and evaluate their performance on held-out data. Table 1 compares both methods using AUC for all five prediction tasks averaged across multiple folds and splits.

Table 1 shows that PPLL gains significantly in AUC over GLS in three out of five settings. For the Gene

link prediction task in the Yeast and Fly biological citation networks, PPLL also yields lower variance given the same rules. In the DDI setting where we predict

Interacts links between drugs, PPLL enjoys a 15% AUC gain over GLS from 0.66 to 0.76. In the Freebase setting, for the BookAuthor task, PPLL again achieves comparable performance with GLS. GLS only improves slightly over the PPLL approximation in one setting, predicting FilmRating with a statistically insignificant gain of 0.02 in AUC.

5.4 Scalability Study

Our second experimental question focuses on the scalability trade-offs made by GLS and PPLL. PPLL requires weight learning over clauses, made faster with parallelized updates while GLS requires iterative rounds of weight learning and model evaluation. We select the two Freebase tasks, BookAuthor and FilmRating where path-constrained clause generation initially yielded several hundred rules. We plot the running time for both methods as the size of the candidate clause set increases from 25 to 200.

Figure 1 shows the running times (in seconds) for both methods plotted in log scale across the two Freebase tasks as the number of clauses to evaluate increases. The results show that while PPLL remains computationally feasible as the number of clauses increases, GLS quickly becomes intractable as the clause set grows. Indeed, for BookAuthor, GLS requires almost two days to learn a model with 200 candidate clauses. In contrast, PPLL completes in four minutes using 200 clauses in the same setting. PPLL overcomes the requirement of interleaving weight learning and scoring while also admitting parallel weight learning updates, boosting scalability. The results suggest that PPLL can explore a larger space of models in significantly less time.

6 Related Work

Finally, we review related work on structure learning approaches for undirected graphical models, which underpin the SRL methods we highlight in this paper. We also provide an overview of work in relational information retrieval which motivates our path-constrained clause generation.

For general Markov random fields (MRF) and their conditional variants, structure learning typically induces feature functions represented as propositional logical clauses of boolean attributes [McCallum2002, Davis and Domingos2010]

. An approximate model score is optimized with a greedy search that iteratively picks clausal feature functions to include while refining candidate features by adding, removing or negating literals to single-literal clauses. MRF structure learning is also viewed as a feature selection problem solved by performing L1-regularized optimization over candidate features, admitting fast gradient descent and online algorithms

[Perkins, Lacker, and Theiler2003, Zhu, Lao, and Xing2010].

Although structure learning has not been studied in PSL, many algorithms have been proposed to learn MLNs. The initial approach to MLN structure learning performs greedy beam search to grow the set of model clauses starting from single-literal clauses. The clause generation performs all possible negations and additions to an existing set of clauses while the search procedure iteratively selects clauses to refine. To efficiently guide the search towards useful models, bottom-up approaches generate informative clauses by using relational paths to capture patterns and motifs in the data [Mihalkova and Mooney2007, Kok and Domingos2009, Kok and Domingos2010]. This relational path mining in bottom-up approaches is related to the path ranking algorithm (PRA) for relational information retrieval [Lao and Cohen2010]. PRA performs random walks or breadth-first traversal on relational data to find useful path-based features for retrieval tasks [Lao and Cohen2010, Gardner et al.2013, Gardner et al.2014].

Most recently, MLN structure learning has been viewed from the perspectives of moralizing learned Bayesian networks

[Khosravi et al.2010]

and functional gradient boosting

[Khot et al.2011, Khot et al.2015]

. These methods improve scalability while maintaining predictive performance. Alternately, approaches have been proposed to learn MLNs for target variables specific to a task of interest as we do for PSL. Structure learning methods for particular tasks use inductive logic programming

[Muggleton1991] to generate clauses which are pruned with L1-regularized learning [Huynh and Mooney2008, Huynh and Mooney2011] or perform iterative local search [Biba, Ferilli, and Esposito2008] to refine rules with the operations described above.

7 Conclusion and Future Work

In this work, we formalize the structure learning problem for PSL and introduce an efficient-to-optimize and convex surrogate objective function, PPLL. We unify scalable optimization with data-driven path-constrained clause generation. Compared to the straightforward but inefficient greedy local search method, PPLL remains scalable as the space of candidate rules grows and demonstrates good predictive performance across five real-world tasks. Although we focus on PSL in this work, our PPLL method can be generalized for MLNs and other SRL frameworks. An important line of future work for PSL structure learning is extending L1-regularized feature selection and functional gradient boosting approaches which have been applied successfully to MRFs and MLNs. These methods have been shown to scale while maintaining good predictive performance.

Acknowledgements

This work is sponsored by the Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA), and supported by NSF grants CCF-1740850 and NSF IIS-1703331. We thank Sriraam Natarajan and Devendra Singh Dhami for sharing their DrugBank dataset.

References

  • [Bach et al.2017] Bach, S. H.; Broecheler, M.; Huang, B.; and Getoor, L. 2017. Hinge-loss Markov random fields and probabilistic soft logic.

    Journal of Machine Learning Research

    18(109):1–67.
  • [Beltagy, Erk, and Mooney2014] Beltagy, I.; Erk, K.; and Mooney, R. J. 2014. Probabilistic soft logic for semantic textual similarity. In ACL.
  • [Besag1975] Besag, J. 1975. Statistical analysis of non-lattice data. The statistician 179–195.
  • [Biba, Ferilli, and Esposito2008] Biba, M.; Ferilli, S.; and Esposito, F. 2008. Discriminative structure learning of Markov logic networks. In ILP.
  • [Davis and Domingos2010] Davis, J., and Domingos, P. 2010. Bottom-up learning of Markov network structure. In ICML.
  • [Ebrahimi, Dou, and Lowd2016] Ebrahimi, J.; Dou, D.; and Lowd, D. 2016. Weakly supervised tweet stance classification by relational bootstrapping. In EMNLP.
  • [Gardner et al.2013] Gardner, M.; Talukdar, P. P.; Kisiel, B.; and Mitchell, T. 2013. Improving learning and inference in a large knowledge-base using latent syntactic cues.
  • [Gardner et al.2014] Gardner, M.; Talukdar, P. P.; Krishnamurthy, J.; and Mitchell, T. 2014.

    Incorporating vector space similarity in random walk inference over knowledge bases.

    In EMNLP.
  • [Huynh and Mooney2008] Huynh, T. N., and Mooney, R. J. 2008. Discriminative structure and parameter learning for Markov logic networks. In ICML.
  • [Huynh and Mooney2011] Huynh, T. N., and Mooney, R. J. 2011. Online structure learning for markov logic networks. In ECML-PKDD.
  • [Johnson and Goldwasser2016] Johnson, K., and Goldwasser, D. 2016. “All I know about politics is what I read in Twitter”: Weakly supervised models for extracting politicians’ stances from Twitter. In COLING.
  • [Kazemi and Poole2018] Kazemi, S. M., and Poole, D. 2018. Bridging weighted rules and graph random walks for statistical relational models. Frontiers in Robotics and AI 5:8.
  • [Khosravi et al.2010] Khosravi, H.; Schulte, O.; Man, T.; Xu, X.; and Bina, B. 2010. Structure learning for Markov logic networks with many descriptive attributes. In AAAI.
  • [Khot et al.2011] Khot, T.; Natarajan, S.; Kersting, K.; and Shavlik, J. 2011. Learning Markov logic networks via functional gradient boosting. In ICDM.
  • [Khot et al.2015] Khot, T.; Natarajan, S.; Kersting, K.; and Shavlik, J. 2015. Gradient-based boosting for statistical relational learning: the markov logic network and missing data cases. Machine Learning 100(1):75–100.
  • [Kok and Domingos2005] Kok, S., and Domingos, P. 2005. Learning the structure of Markov logic networks. In ICML.
  • [Kok and Domingos2009] Kok, S., and Domingos, P. 2009. Learning Markov logic network structure via hypergraph lifting. In ICML.
  • [Kok and Domingos2010] Kok, S., and Domingos, P. 2010. Learning Markov logic networks using structural motifs. In ICML.
  • [Lao and Cohen2010] Lao, N., and Cohen, W. W. 2010. Relational retrieval using a combination of path-constrained random walks. Machine learning 81(1):53–67.
  • [Lao, Mitchell, and Cohen2011] Lao, N.; Mitchell, T.; and Cohen, W. W. 2011. Random walk inference and learning in a large scale knowledge base. In EMNLP.
  • [McCallum2002] McCallum, A. 2002. Efficiently inducing features of conditional random fields. In UAI.
  • [Mihalkova and Mooney2007] Mihalkova, L., and Mooney, R. J. 2007. Bottom-up learning of Markov logic network structure. In ICML.
  • [Muggleton1991] Muggleton, S. 1991. Inductive logic programming. New generation computing 8(4):295–318.
  • [Perkins, Lacker, and Theiler2003] Perkins, S.; Lacker, K.; and Theiler, J. 2003. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research 3(Mar):1333–1356.
  • [Platanios et al.2017] Platanios, E.; Poon, H.; Mitchell, T. M.; and Horvitz, E. J. 2017. Estimating accuracy from unlabeled data: A probabilistic logic approach. In NIPS.
  • [Richardson and Domingos2006] Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine learning 62(1-2):107–136.
  • [Sutton and McCallum2007] Sutton, C., and McCallum, A. 2007. Piecewise pseudolikelihood for efficient training of conditional random fields. In ICML.
  • [Zhu, Lao, and Xing2010] Zhu, J.; Lao, N.; and Xing, E. P. 2010. Grafting-light: fast, incremental feature selection and structure learning of markov random fields. In KDD.