Stacked Structure Learning for Lifted Relational Neural Networks

10/05/2017 ∙ by Gustav Sourek, et al. ∙ Czech Technical University in Prague Cardiff University 0

Lifted Relational Neural Networks (LRNNs) describe relational domains using weighted first-order rules which act as templates for constructing feed-forward neural networks. While previous work has shown that using LRNNs can lead to state-of-the-art results in various ILP tasks, these results depended on hand-crafted rules. In this paper, we extend the framework of LRNNs with structure learning, thus enabling a fully automated learning process. Similarly to many ILP methods, our structure learning algorithm proceeds in an iterative fashion by top-down searching through the hypothesis space of all possible Horn clauses, considering the predicates that occur in the training examples as well as invented soft concepts entailed by the best weighted rules found so far. In the experiments, we demonstrate the ability to automatically induce useful hierarchical soft concepts leading to deep LRNNs with a competitive predictive power.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lifted Relational Neural Networks (LRNNs [15]) are weighted sets of first-order rules, which are used to construct feed-forward neural networks from relational structures. A central characteristic of LRNNs is that a different neural network is constructed for each learning example, but crucially, the weights of these different neural networks are shared. This allows LRNNs to use neural networks for learning in relational domains, despite the fact that training examples may vary considerably in size and structure.

In previous work, LRNNs have been learned from hand-crafted rules. In such cases, only the weights of the first-order rules have to be learned from training data, which can be accomplished using a variant of back-propagation. The use of hand-crafted rules offers a natural way to incorporate domain knowledge in the learning process. In some applications, however, (sufficient) domain knowledge is lacking and both the rules and their weights have to be learned from data. To this end, in this paper we introduce a structure learning method for LRNNs.

Our proposed structure learning method proceeds in an iterative fashion. In each iteration, it may either learn a set of rules that intuitively correspond to a new layer of a neural network template or to learn a set of rules that intuitively correspond to creating new connections among existing layers, a strategy which we refer to as stacked structure learning. The rules that are added in a given iteration either define one of the target predicates, or they define a new predicate that may depend on predicates that were ‘invented’ at earlier layers as well as on predicates from the considered domain. Since the actual meaning of these predicates depends on both the learned rules and their associated weights, structure learning is alternated with weight learning. Intuitively, this means that the definitions of predicates defined in earlier layers can be fine-tuned based on the rules which are added to later layers.

We present experimental result which show that the resulting LRNNs perform comparably to LRNNs that have been learned from hand-crafted rules. We believe that this makes LRNNs a particularly convenient framework for learning in relational domains, without any need for prior knowledge nor for any extensive hypertuning. Somewhat surprisingly, we find that LRNNs with learned rules are often more compact than those with hand-crafted rules.

The remainder of the paper is structured as follows. In the next section, we first provide the required background on LRNNs. In Section 3, we then present the proposed stucture learning method, after which we discuss our experimental results in Section 4.

2 Preliminaries

In this section, we briefly recall the LRNN framework from [15].

LRNN Structure.

A lifted relational neural network (LRNN) is a set of weighted definite clauses, i.e. a set of pairs where is a definite clause and . For a LRNN , we write to denote the corresponding set of definite clauses, i.e. . The grounding of a LRNN is defined as , where is the restriction of the grounding of to those clauses that correspond to active rules, i.e. rules whose antecedent is satisfied in the least Herbrand model of . The neural network corresponding to

contains the following types of neurons:

  • For each ground atom occurring in , there is a neuron , called an atom neuron.

  • For each ground fact , there is a neuron , called a fact neuron.

  • For every ground rule , there is a neuron , called a rule neuron.

  • For every (possibly non-ground) rule and every grounding of that occurs in , there is a neuron , called an aggregation neuron.

Forward propagation.

Intuitively, the neural network computes for each ground atom a truth value, which is given by the output of the atom neuron . To obtain these truth values, the network propagates values in a way which closely mimics the immediate consequence operator from logic progamming. In particular, when using the immediate consequence operator, there are two ways in which can become true: if corresponds to a fact, or if is the head of a rule whose body is already satisfied. Similarly, the inputs of the atom neuron consist of the fact neurons of the form and aggregation neurons of the form . The output of an atom neuron with inputs is given by , where

is an activation function that maps the inputs to a real-valued output. In this paper we will use

where sigm

is the sigmoid function

. We set the parameters and , as then closely approximates the Łukasiewicz fuzzy disjunction [7] (see right panel in Figure 1

). This helps with the interpretability of LRNNs, as it means that we can intuitively think of the activation functions as logical connectives, and of LRNNs as (fuzzy) logic programs.

A fact neuron has no input and has the value as its output. The output of the aggregation neuron intuitively expresses how strongly can be derived using the rule . The inputs of the aggregation neuron are all rule neurons for which . The output of this aggregation neuron is given by , where are its inputs, is an activation function, and is the weight of the corresponding rule. We will use

The rule neuron intuitively needs to fire if the atoms are all true. Accordingly, its inputs are given by the atom neurons , and its output is , with a third type of activation function. In this paper we will use the activation function

where we set and , which approximates Łukasiewicz fuzzy conjunction [7] (see left panel in Figure 1).

Figure 1: An approximation of Łukasiewicz conjunction (left) and disjunction (right) by sigmoidal activation functions and for the use in LRNNs.

Weight learning.

In applications, we usually consider LRNNs of the form , where is a weighted set of first-order rules and is a weighted set of ground facts. In particular, each represents an example, while acts as a template for constructing feed-forward neural networks, with being the network corresponding to example . While the weights of are given, the weights of typically need to be learned from training data, as follows.

We are given a list of examples where each is a LRNN (typically containing only weighted ground facts), and a list of training queries where each is a ground atom, which we call a training query atom, and is its target value. For a query atom , let denote the output of the atom neuron in the ground neural network of . The goal of the learning process is to find the weights of the rules (and possibly facts) in for which the loss on the training query atoms

is minimized. This loss function is then optimized using standard stochastic gradient descent algorithm

[2]. For details about weight learning of LRNNs, see [15].

3 Structure Learning

In this section we describe a structure learning algorithm for LRNNs. The algorithm receives a list of training examples and a list of training queries, and it produces a LRNN. For simplicity, we will assume that constants are only used as identifiers of objects. In particular, we will assume that attribute values are represented using unary literals, e.g. we would use instead of . Besides that we do not put any restrictions on the structure of the training examples.

3.1 Structure of the Learned LRNNs

The structure learning algorithm will create LRNNs having a generic “stacked” structure which we now describe. First, there are rules that define new predicates, representing soft clusters [17] of unary predicates from the dataset. These can be thought of as the first layer of the LRNN, where the weighted facts from the dataset comprise the zeroth layer. For instance, if the unary predicates in the dataset are then the LRNN will contain the following rules:

Here each is a latent predicate representing a soft cluster, the index denotes the layer in which it appears (in this case, the first layer) and indexes the individual soft clusters in that level.

In general, the second layer will consist of two types of rules. First, there may be rules introducing new latent predicates. In contrast to the unary predicates that were introduced in the first layer, here the latent predicates could be also of higher arity , although in practice an upper bound will be imposed for efficiency reasons. In the body of these rules, we may find predicates from the dataset itself, or latent predicates that were introduced in the first layer. The new latent predicates introduced in these rules may then be used in the bodies of rules in subsequent layers. Second, there may also be rules that have a predicate from the dataset in their head. These will typically be rules that were learned to predict the target predicates that we want to learn.

Example 1.

For instance, in datasets of molecules, unary predicates can be used to represent types of atoms, such as carbon or hydrogen. An example of a possible second layer rule is:

Here is assumed to be one of the predicates from the dataset. Second layer rules that introduce a new latent predicate could look as follows.

The actual intuitive meaning of the predicate will depend on the weights , . For instance, if both are large enough, the (atom neurons corresponding to the) predicate will have high output whenever its arguments correspond to two atoms which are either one or two steps apart from each other in the molecule, and which have sufficiently high membership in the soft cluster .

Any higher layers have a similar structure to the second layer, where the layer contains rules whose bodies only contain predicates from layers 0 to , and whose heads either contain a target predicate or introduce a new latent predicate.

3.2 Structure Learning Algorithm

1: learning examples
2: latent concepts’ dimension
3:
4:
5:
6:
7:while  do
8:     
9:     
10:     
11:     
12:     
13:end while
14:return
Algorithm 1 General schema of structure learning

The structure learning algorithm (Algorithm 1) iteratively constructs LRNNs that have the structure described in the previous section. It alternates weight learning steps with rule learning steps666 Variants of this strategy are employed by many structure learning algorithms in the context of statistical relational learning, e.g. [4, 8, 5].. In the weight learning steps, the algorithm uses stochastic gradient descent to minimise the squared loss of the LRNN by optimising the weights of the rules, as described in Section 2. In the rule learning steps, the algorithm fixes the weights of all rules which define latent predicates and it searches for some good rule . This rule should be such that the squared loss of the LRNN decreases after we add to it and and after we retrain the weights of all rules with non-latent head predicates. Next we describe this algorithm in detail.

The first step of the structure learning algorithm (lines 4–5) is the construction of the first level of the LRNN, which defines the unary predicates representing soft clusters of object properties, as described in Section 3.1.

After the first step, the algorithm repeats the following procedure for a given number of iterations or until no suitable rules can be found anymore. It fixes the weights of all rules defining latent predicates (line 6). Then it runs a beam search algorithm searching through the space of possible rules777The space of rules is defined by two user-specified constraints: maximum rule length and maximum number of variables in a rule. (line 8). The scoring function which is used by the beam search algorithm is computed as follows. Given a rule , the algorithm creates a copy of the current LRNN to which the given candidate rule

is added. It then optimises the log-loss of this new LRNN (which corresponds to maximum-likelihood estimation for logistic regression), training just the non-fixed weights, i.e. the weights of the rules with non-latent predicates in their heads. The score of the rule

is then defined to be the log-loss after training the non-fixed weights. The reason why we do not retrain all weights of the LRNN when checking score of a rule are efficiency considerations because training the weights of the whole LRNN corresponds to training a deep neural network. After the beam search algorithm finishes, the rule that it returned is added to the original LRNN.

Note that contains one of the target predicates in its head. However, in addition to adding , we also add a set of related rules that have latent predicates in their head (line 9), as follows. Here, we will assume for simplicity that all latent predicates have the same arity , but the same method can still be used when the latent predicates are allowed to have different arities. Let be the highest index such that contains a latent predicate of the form (i.e. a latent predicate from layer ) in its body, where we assume if does not contain any latent predicates . Then for each latent predicate from the -th layer, the algorithm adds to the LRNN all rules which have in the head and which can be obtained by unifying with the variables in . This process is illustrated in the following example.

Example 2.

Revisiting the example of molecular datasets, let and let . Then the algorithm will add the following latent-predicate rules:

Note that the algorithm has to add the new rules to the layer because already contained predicates from the layer .

After the LRNN has been extended by all these rules obtained from , the weights of all the rules, including those corresponding to latent predicates, are retrained using stochastic gradient descent (line 11). Note that typically there will be some latent predicates which are not used in any rules; their weights are not considered during training. Subsequently, the algorithm again fixes the weights of the rules corresponding to the latent predicates, and repeats the same process to find an additional rule. This is repeated until a given stopping condition is met.

4 Experiments

In this section we describe the results of experiments performed with the structure learning algorithm on a real-life molecular dataset. We performed experiments on 72 NCI datasets [13], each of which contains several thousands of molecules, labeled by their ability to inhibit the growth of different types of tumors. We compare the performance of the proposed LRNN structure learning method with the best previously published LRNNs, which contain large generic, yet manually constructed weighted rule sets [15]. For further comparison we include the relational learners kFOIL [10] and nFOIL [9]

, which respectively combine relational rule learning with support vector machines and with naive Bayes learning.

The results are shown in Figure 2 and Figure 3. The automatically learned LRNNs outperform both kFOIL and nFOIL in terms of predictive accuracy (measured using cross-validation). The learned LRNNs are also competitive with the manually constructed LRNNs from [16, 15], although they do not outperform them. They are slightly worse than the largest of the manually constructed LRNNs, based on graph patterns with vertices, enumerating all possible combinations of soft cluster types of the three atoms and soft cluster types of the two bonds connecting them. Figure 4 displays statistics of the learned LRNN rule sets. These statistics show that the structure learner turned out to produce quite complex LRNNs having multiple layers of invented latent predicates.

Figure 2: Comparison of crossvalidated test errors of LRNNs produced by structure learning with nFoil and kFoil learners as baselines.
Figure 3: Comparison of test errors of LRNNs produced automatically by structure learning with 3 handcrafted LRNNs with varying lengths of chain patterns from [15].
Figure 4: Statistics of the learned LRNN rule sets from experiments with the 72 NCI datasets. We display (i) the number of rules (including zeroth layer soft clusters), (ii) the number of conjunctive rules (patterns) learned, (iii) the average length of these rules (patterns), and (iv) the overall number of layers (depth of template).

The weights of the rules defining the latent predicates in the first layer of the LRNN can be interpreted as coordinates of a vector-space embedding of the properties (atom types in our case). In Figure 5, we plot the evolution of these embeddings as new rules are being added by the structure learning algorithm. The left panel of Figure 5 displays the evolution of the embeddings of atom types after these have been pre-trained using an unsupervised method which was originally used for statistical predicate invention in [17]. The right panel of the same figure displays the evolution of the embeddings when starting from random initialization without any unsupervised pre-training. What can be seen from these figures is how, as the model becomes more complex, the atom types start to make more visible clusters. Interestingly and perhaps somewhat against intuition, the use of the unsupervised pre-training seemed to consistently decrease predictive performance (we omit details due to limited space).

Figure 5: PCA projection of evolution of atom embeddings during first 6 iterations (denoted by colors) of structure learning of a LRNN, with initialization based on unsupervised pre-training (left) and with completely random initialization (right).

5 Related Work

LRNNs are related to many older works on using neural networks for relational learning such as [1] and more recent approaches such as [14, 3]. The structure learning strategy that we employ in the methods presented in this paper is in many respects similar to structure learning methods from statistical relational learning such as [4, 8, 5]. However, what clearly distinguishes it from all these previous SRL approaches is its ability to automatically induce hierarchies of latent concepts. In this respect, it is also related to meta-interpretive learning [11]. However, meta-interpretive learning is only applicable to the learning of crisp logic programs. The structure learning approach is also related to works on refining architectures of neural networks [6, 12]. However, from these it differs in its ability to handle relational data.

6 Conclusions and Future Work

In this paper we have introduced a method for learning the structure of LRNNs, capable of learning deep weighted rule sets with invented latent predicates. The predictive accuracies obtained by the learned LRNNs were competitive with results that we obtained in our previous work using manually constructed LRNNs. The method presented in this paper therefore has the potential to make LRNNs useful in domains where it would otherwise be difficult to come up with a rule set manually. It also makes the adoption of LRNNs by non-expert users more straightforward, as the proposed method can learn competitive LRNNs without requiring any user input (besides the dataset).

Acknowledgements GŠ, MS and FŽ acknowledge support by project no. 17-26999S granted by the Czech Science Foundation. OK is supported by a grant from the Leverhulme Trust (RPG-2014-164). SS is supported by ERC Starting Grant 637277. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.

References

  • [1] Blockeel, H., Uwents, W.: Using neural networks for relational learning. In: ICML-2004 Workshop on Statistical Relational Learning and its Connection to Other Fields, pp. 23–28 (2004)
  • [2] Bottou, L.: Stochastic gradient descent tricks. In: Neural networks: Tricks of the trade, pp. 421–436. Springer (2012)
  • [3] Cohen, W.W.: Tensorlog: A differentiable deductive database. arXiv preprint arXiv:1605.06523 (2016)
  • [4]

    Davis, J., Burnside, E.S., de Castro Dutra, I., Page, D., Costa, V.S.: An integrated approach to learning bayesian networks of rules.

    In: Proceedings of the 16th European Conference on Machine Learning, pp. 84–95 (2005)

  • [5] Dinh, Q.T., Exbrayat, M., Vrain, C.: Generative structure learning for markov logic networks based on graph of predicates.

    In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 1249 (2011)

  • [6] Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture (1989)
  • [7] Hájek, P.: Metamathematics of fuzzy logic, vol. 4. Springer Science & Business Media (1998)
  • [8] Kok, S., Domingos, P.: Learning the structure of markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 441–448 (2005)
  • [9] Landwehr, N., Kersting, K., Raedt, L.D.: Integrating naive bayes and foil. The Journal of Machine Learning Research 8, 481–507 (2007)
  • [10] Landwehr, N., Passerini, A., De Raedt, L., Frasconi, P.: kFOIL: learning simple relational kernels. In: AAAI’06: Proceedings of the 21st national conference on Artificial intelligence, pp. 389–394. AAAI Press (2006)
  • [11] Muggleton, S.H., Lin, D., Tamaddoni-Nezhad, A.: Meta-interpretive learning of higher-order dyadic datalog: predicate invention revisited. Machine Learning 100(1), 49–73 (2015)
  • [12]

    Opitz, D.W., Shavlik, J.W.: Heuristically expanding knowledge-based neural networks.

    In: IJCAI, pp. 1360–1365 (1993)
  • [13] Ralaivola, L., Swamidass, S.J., Saigo, H., Baldi, P.: Graph kernels for chemical informatics. Neural Netw. 18(8), 1093–1110 (2005)
  • [14] Rocktäschel, T., Riedel, S.: Learning knowledge base inference with neural theorem provers. In: NAACL Workshop on Automated Knowledge Base Construction (AKBC) (2016)
  • [15] Šourek, G., Aschenbrenner, V., Železný, F., Kuželka, O.: Lifted relational neural networks. In: Proceedings of the NIPS Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches (2015)
  • [16] Šourek, G., Aschenbrenner, V., Železný, F., Kuželka, O.: Lifted Relational Neural Networks. arXiv preprint (2015). URL http://arxiv.org/abs/1508.05128
  • [17] Šourek, G., Manandhar, S., Železný, F., Schockaert, S., Kuželka, O.: Learning predictive categories using lifted relational neural networks. In: ILP’16, 26th International Conference on Inductive Logic Programming (2016)