# Can Graph Neural Networks Help Logic Reasoning?

Effectively combining logic reasoning and probabilistic inference has been a long-standing goal of machine learning: the former has the ability to generalize with small training data, while the latter provides a principled framework for dealing with noisy data. However, existing methods for combining the best of both worlds are typically computationally intensive. In this paper, we focus on Markov Logic Networks and explore the use of graph neural networks (GNNs) for representing probabilistic logic inference. It is revealed from our analysis that the representation power of GNN alone is not enough for such a task. We instead propose a more expressive variant, called ExpressGNN, which can perform effective probabilistic logic inference while being able to scale to a large number of entities. We demonstrate by several benchmark datasets that ExpressGNN has the potential to advance probabilistic logic reasoning to the next stage.

## Authors

• 9 publications
• 11 publications
• 8 publications
• 6 publications
• 180 publications
• 34 publications
• 81 publications
• ### Efficient Probabilistic Logic Reasoning with Graph Neural Networks

Markov Logic Networks (MLNs), which elegantly combine logic rules and pr...
01/29/2020 ∙ by Yuyu Zhang, et al. ∙ 0

• ### Probabilistic Similarity Logic

Many machine learning applications require the ability to learn from and...
03/15/2012 ∙ by Matthias Brocheler, et al. ∙ 0

• ### A Survey on The Expressive Power of Graph Neural Networks

Graph neural networks (GNNs) are effective machine learning models for v...
03/09/2020 ∙ by Ryoma Sato, et al. ∙ 1

• ### Domain Aware Markov Logic Networks

Combining logic and probability has been a long standing goal of AI. Mar...
07/03/2018 ∙ by Happy Mittal, et al. ∙ 0

• ### Differentiable Probabilistic Logic Networks

Probabilistic logic reasoning is a central component of such cognitive a...
07/10/2019 ∙ by Alexey Potapov, et al. ∙ 0

• ### Markov Logic Networks with Statistical Quantifiers

Markov Logic Networks (MLNs) are well-suited for expressing statistics s...
07/03/2018 ∙ by Víctor Gutiérrez-Basulto, et al. ∙ 0

• ### Logic could be learned from images

Logic reasoning is a significant ability of human intelligence and also ...
08/06/2019 ∙ by Qian Guo, et al. ∙ 8

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

An elegant framework of combining logic reasoning and probabilistic inference is Markov Logic Network (MLN) [1]

, where logic predicates are treated as random variables and logic formulae are used to define the potential functions. It has greatly extended the ability of logic reasoning to handle noisy facts and partially correct logic formulae common in real-world problems. Furthermore, MLN enables probabilistic graphical models to exploit prior knowledge and learn in the region of small or zero training samples. This second aspect is important in the context of lifelong learning and massive multitask learning, where most prediction targets have insufficient number of labeled data.

However, a central challenge is that probabilistic inference in MLN is computationally intensive. It contains many random variables if there are entities and the involved predicate has arguments. Approximate inference techniques such as MCMC and belief propagation have been proposed, but the large MLN makes them barely scalable to hundreds of entities.

Graph neural network (GNN) is a popular tool of learning representation for graph data, including but not limited to social networks, molecular graphs, and knowledge graphs

[2, 3, 4, 5, 6, 7]. It is natural to think that GNNs have the potential to improve the effectiveness of probabilistic logic inference in MLN. However, it is not clear why and how exactly GNNs may help.

In this paper, we explore the use of GNN for scalable probabilistic logic inference in MLN, and provide an affirmative answer on how to do that. In our method, GNN is applied to knowledge bases which can be orders of magnitude smaller than grounded MLN; and then GNN embeddings are used to define mean field distributions in probabilistic logic inference. However, our analysis reveals that GNN embeddings alone will lead to inconsistent parametrization due to the additional asymmetry created by logic formulae. Motivated by this analysis, we propose a more expressive variant, called ExpressGNN, which consists of (1) an inductive GNN embedding component for learning representation from knowledge bases; (2) and a transductive and tunable embedding component for compensating the asymmetry created by logic formulae in MLN.

We show by experiments that mean field approximation with ExpressGNN enables efficient and effective probabilistic logic inference in modern knowledge bases. Furthermore, ExpressGNN can achieve these results with far fewer parameters than purely transductive embeddings, and yet it has the ability to adapt and generalize to new knowledge graphs.

Related work. Previous probabilistic logic inference techniques either use sampling methods or belief propagation. More advanced variants have been proposed to make use of symmetries in MLNs to reduce computation (e.g., the lifted inference algorithms [8, 9]). However, these inference methods still barely scale to hundreds of entities. We describe specific methods and compare their performances in Section 7. A recent seminal work explored the use of GNN for relation prediction, but the additional challenges from logic formula are not considered [10].

## 2 Knowledge Bases and Markov Logic Networks

Knowledge base. Typically, a knowledge base consists of a tuple , with a set of entities, a set of relations, and a collection of observed facts. In the language of first-order logic, entities are also called constants. For instance, a constant can be a person or an object. Relations are also called predicates. Each predicate is a logic function defined over , i.e., In general, the arguments of predicates are asymmetric. For instance, for the predicate (L for Like) which checks whether likes , the arguments and are not exchangeable.

With a particular set of entities assigned to the arguments, the predicate is called a grounded predicate, and each grounded predicate a binary random variable, which will be used to define MLN. For a -ary predicate, there are ways to ground it. We denote an assignment as . For instance, with , we can simply write a grounded predicate as . Each observed fact in knowledge bases is a truth value assigned to a grounded predicate. For instance, a fact can be . The number of observed facts is typically much smaller than that of unobserved facts. We adopt the open-world paradigm and treat these unobserved facts latent variables.

As a more clear representation, we express a knowledge base by a bipartite graph , where nodes on one side of the graph correspond to constants and nodes on the other side correspond to observed facts , which is called factor in this case (Fig.1). The set of edges, , will connect constants and the observed facts. More specifically, an edge

between node and exists, if the grounded predicate associated with uses as an argument in its -th argument position. (See Fig. 1 for an illustration.)

Markov Logic Networks. MLNs use logic formulae to define potential functions in undirected graphical models. A logic formula is a binary function defined via the composition of a few predicates. For instance, a logic formula can be

 Smoke(c)∧Friend(c,c′)⇒Smoke(c′)⟺¬Smoke(c)∨¬Friend(c,c′)∨Smoke(c′),

where is negation and the equivalence is established by De Morgan’s law. Similar to predicates, we denote an assignment of constants to the arguments of a formula as , and the entire collection of consistent assignments of constants as

. Given these logic representations, MLN can be defined as a joint distribution over all observed facts

and unobserved facts as (Fig. 1)

 P\rbr\Ocal,\Hcal:=1Zexp\rbr∑f∈\Fcalwf∑af∈\Acalfϕf(af), (1)

where is a normalizing constant summing over all grounded predicates and is the potential function defined by a formula . One form of can simply be the truth value of the logic formula . For instance, if the formula is , then can simply take value when is true and otherwise. Other more sophisticated can also be designed, which have the potential to take into account complex entities, such as images or texts, but will not be the focus of this paper. The weight can be viewed as the confidence score of formulae : the higher the weight, the more accurate the formula is.

## 3 Challenges for Inference in Markov Logic Networks

Inference in Markov Logic Networks can be very computationally intensive, since the inference needs to be carried out in the fully grounded network involving all grounded variables and formula nodes. Most previous inference methods barely scale to hundreds of entities.

Mean field approximation. We will focus on mean field approximation, since it has been demonstrated to scale up to many large graphical models, such as latent Dirichlet allocation for modeling topics from large text corpus [10, 11]. In this case, the conditional distribution is approximated by a product distribution, . The set of mean field distributions can be determined by KL-divergence minimization

 \cbrQ∗(r(ar)) =argmin{Q(r(ar))}  \scalebox0.95$KL\rbr∏r(ar)∈\HcalQ(r(ar))∥P(\Hcal|\Ocal)$ (2) =argmin{Q(r(ar))}  \scalebox0.95$∑r(ar)∈\Hcal\EE[lnQ(r(ar))]−∑f∈\Fcalwf∑af∈\Acalf\EE[ϕf(af)|\Ocal]$,

where means that observed predicates in

are fixed to their actual values with probability 1. For instance, a grounded formula

with observations and will result in . In theory, one can use mean field iteration to obtain the optimal , but due to the large number of nodes in the grounded network, this iterative algorithm can be very inefficient.

Thus, we need to carefully think about how to parametrize the set of , such that the parametrization is expressive enough for representing posterior distributions, while at the same time leading to efficient algorithms. Some common choices are

• [leftmargin=*,nolistsep,nosep]

• Naive parametrization. Assign each a parameter . Such parametrization is very expressive, but the number of parameters is the same as MLN size .

• Tunable embedding parametrization. Assign each entity

a vector embedding

, and define using involved entities. For instance, where is a neural network specific to predicate and is the standard logistic function. The number parameters in such scheme is linear in the number of entities, , but very high dimensional embedding may be needed to express the posteriors.

Note that both schemes are transductive, and the learned or can only be used for the training graph, but can not be used for new entities or different but related knowledge graphs (inductive setting).

Stochastic inference. The objective in Eq. (2) contains an expensive summation, making its evaluation and optimization inefficient. For instance, for formula , the number of terms involved in will be square in the number of entities. Thus, we approximate the objective function with stochastic sampling, and then optimize the parameters in

via stochastic gradients, and various strategies can be used to reduce the variance of the stochastic gradients

[10, 11].

## 4 Graph Neural Network for Inference

To efficiently parametrize the entity embeddings with less parameters than tunable embeddings, we propose to use a GNN on the knowledge graph , much smaller than the fully grounded MLN (Figure 1), to generate embeddings of each entity , and then use these embeddings to define mean field distributions. The advantage of GNN is that the number of parameters can be independent of the number of entities. Any entity embedding can be reproduced by running GNN iterations online. Thus, GNN based parametrization can potentially be very memory efficient, making it possible to scale up to a large number of entities. Furthermore, the learned GNN parameters can be used for both transductive and inductive settings.

The architecture of GNN over a knowledge graph is given in Algorithm 1, where the multilayer neural networks and take values of observed facts and argument positions into account, and are standard multilayer neural networks, and and are typically sum pooling functions. These embedding updates are carried out for a finite times. In general, the more iterations are carried out, the larger the graph neighborhood around a node will be integrated into the representation. For simplicity of notation, we use use and to refer to the final embeddings and . Then these embeddings are used to define the mean field distributions. For instance, if a predicate involves two entities and , can be defined as

 Q(r(c,c′))=logistic\rbrMLPr\rbrμc,μc′,  where {μc,μo}=GNN(\Gcal\Kcal) (3)

and are predicate specific multilayer neural networks. For dimensional embeddings, the number of parameters in GNN model is typically , independent of the number of entities.

## 5 Is GNN Expressive Enough?

Now a central question is: is GNN parametrization using the knowledge graph expressive enough? Or will there be situations where two random variables and have different distributions in MLN but and are forced to be the same due to the above characteristic of GNN embeddings? The question arises since the knowledge graph is different from the fully grounded MLN (Figure 1). We will use two theorems (proofs are all given in Appendix A) to analyze whether GNN is expressive enough. Our main results can be summarized as:

• [wide,nolistsep,nosep]

• Without formulae tying together the predicates, GNN embeddings are expressive enough for feature representations of latent facts in the knowledge base.

• With formulae in MLN modeling the dependency between predicates, GNN embedding becomes insufficient for posterior parametrization.

These theoretical anlayses motivate a more expressive variant, called ExpressiveGNN in Section 6, to compensate for the insufficient representaiton power of GNN for posterior parametrization.

### 5.1 Property of GNN

Recent research shows that GNNs can learn to perform approximate graph isomorphism check [4, 12]. In our case, graph isomorphism of knowledge graphs is defined as follows.

###### Definition 5.1 (Graph Isomorphism).

An isomorphism of graphs and is a bijection between the nodes such that (1) ; (2) and are adjacent in if and only if and are adjacent in and .

In this definition, NodeType is determined by the associated predicate of a fact, i.e., . EdgeType is determined by the observed value and argument position, i.e., . Our GNN architecture in Algorithm 1 is adapted to these graph topology specifications. More precisely, the initial node embeddings correspond to NodeTypes and different MLPs are used for different EdgeTypes.

It has been proved that GNNs with injections acting on neighborhood features can be as powerful as 1-dimensional Weisfeiler-Lehman graph isomorphism test [13, 12], also known as color refinement. The color refinement procedure is given in Algorithm 1 by red texts, which is analogous to updates of GNN embeddings. We use it to define indistinguishable nodes.

###### Definition 5.2 (Indistinguishable Nodes).

Two nodes in a graph are indistinguishable if they have the same color after the color refinement procedure terminates and no further refinement of the node color classes is possible. In this paper, we use the following notation:

 (c1,⋯,cn)G⟷(c′1,⋯,c′n): for each i∈[n],ci and c′i are indistinguishable in G.

While GNN has the ability to do color refinement, it is at most as powerful as color refinement [12]. Therefore, if , their final GNN embeddings will be the same, i.e., . When we use GNN embeddings to define as in Eq. (3), it implies that .

### 5.2 GNN is expressive for feature representations in knowledge bases

GNN embeddings are computed based on knowledge bases involving only observed facts , but the resulting entity embeddings will be used to represent a much larger set of unobserved facts . In this section, we will show that, when formulae are not considered, these entity embeddings are expressive enough for representing the latent facts .

To better explain our idea, we define a fully grounded knowledge base as: where all unobserved facts are included and assigned a value of “”. Therefore, the facts in can take one of three different values, i.e., . Its corresponding augmented factor graph is . See Fig. 2 for an illustration of and .

###### Theorem 5.1.

Let be the factor graph for a knowledge base and be the corresponding augmented version. Then the following two statements are true:

• [wide,nolistsep,nosep]

• if and only if ;

• if and only if .

Intuitively, Theorem 5.1 means that without considering the presence of formulae in MLN, unobserved predicates can be represented purely based on GNN embeddings obtained from . For instance, to represent an unobserved predicate in , we only need to compute , and then use as its feature. This feature representation is as expressive as that obtained from , and thus drastically reducing the computation.

### 5.3 GNN is not expressive enough for posterior parametrization

Taking into account the influence of formulae on the posterior distributions of predicates, we can show that GNN embeddings alone become insufficient representations for parameterizing these posteriors. To better explain our analysis in this section, we first extend the definition of graph isomorphism to node isomorphism and then state the theorem.

###### Definition 5.3 (Isomorphic Nodes).

Two ordered sequences of nodes and are isomorphic in a graph if there exists an isomorphism from to itself, i.e., , such that . Further, we use the following notation

 (c1,⋯,cn)GK⟺(c′1,⋯,c′n):(c1,⋯,cn) and (c′1,⋯,c′n) are isomorphic% in GK.
###### Theorem 5.2.

Consider a knowledge base and any . Two latent random variables and have the same posterior distribution in any MLN if and only if .

###### Remark.

We say two random variables have the same posterior if the marginal distributions and are the same and, moreover, for any sequence of random variables in , there exists a sequence of random variables in such that the marginal distributions and are the same.

A proof is given in Appendix A. The proof of necessary condition is basically showing that, if and are NOT isomorphic in , we can always define a formula which can make and distinguishable in MLN and have different posterior distributions. It implies an important fact that, to obtain an expressive representation for the posterior,

• [wide,nolistsep,nosep]

• either GNN embeddings need to be powerful enough to distinguish non-isomorphic nodes;

• or the information of formulae / MLN need to be incorporated into the parametrization.

The second condition (2) somehow defeats our purpose of using mean field approximation to speed up the inference, so it is currently not considered. Unfortunately, condition (1) is also not satisfied, because existing GNNs are at most as powerful as color refinement, which is not an exact graph isomorphism test. Besides, node isomorphism mentioned in Theorem 5.2 is even more complex than graph isomorphism because it is a constrained graph isomorphism.

We will interpret the implications of this theorem by an example. Figure 3 shows a factor graph representation for a knowledge base which leads to the following observations:

Even though and have opposite relations with , i.e., but , and are indistinguishable in and thus have the same GNN embeddings, i.e., .

Suppose is the only formula in MLN (L is Like and F is Friend). and apparently have different posteriors. However, using GNN embeddings, is always identical to .

There exists an isomorphism such that , but no isomorphism satisfies both and . Therefore, we see how the isomorphism constraints in Theorem 5.2 make the problem even more complex than graph isomorphism check.

To conclude, it is revealed that node embeddings by GNN alone are not enough to express the posterior in MLN. We provide more examples in Appendix B to explain that this case is very common and not a rare case. In the next section, we will introduce a way of correcting the node embeddings.

## 6 ExpressGNN: More Expressive GNN with Tunable Embeddings

It is currently challenging to design a new GNN that can check nodes isomorphism, because no polynomial-time algorithm is known even for unconstrained graph isomorphism test [14, 15]. In this section, we propose a simple yet effective solution. Take Figure 3 as an example:

To make different from , we can simply introduce additional low dimensional tunable embeddings , , and and correct the parametrization as

 logistic\rbrMLPL([μA,ωA],[μE,ωE]) and logistic\rbrMLPL([μB,ωB],[μE,ωE]).

With tunable and , and can result in different values. In general, we can assign each entity a low-dimensional tunable embedding and concatenate it with the GNN embedding to represent this entity. We call this variant ExpressGNN and describe the parametrization of in Algorithm 2.

One can think of ExpressGNN as a hierarchical encoding of entities: GNN embeddings assign similar codes to nodes similar in knowledge graph neighborhoods, while the tunable embeddings provide additional capacity to code variations beyond knowledge graph structures. The hope is only a very low dimensional tunable embedding is needed to fine-tune individual differences. Then the total number of parameters in ExpressGNN could be much smaller than using tunable embedding alone.

ExpressGNN also presents an interesting trade-off between induction and transduction ability. The GNN embedding part allows ExpressGNN to possess some generalization ability to new entities and different knowledge graphs; while the tunable embedding part gives ExpressGNN the extra representation power to perform accurate inference in the current knowledge graph.

## 7 Experiments

Our experiments show that mean field approximation with ExpressGNN enables efficient and effective probabilistic logic inference and lead to to state-of-the-art results in several benchmark and large datasets.

Benchmark datasets. (i) UW-CSE contains information of students and professors in five department (AI, Graphics, Language, System, Theory) [1]. (ii) Cora [16] contains a collection of citations to computer science research papers. It is split into five subsets according to the research field. (iii) synthetic Kinship datasets contain kinship relationships (e.g., Father, Brother) and resemble the popular Kinship dataset [17]. (iv) FB15K-237 is a large-scale knowledge base [18]. Statistics of datasets are provided in Table 1. See more details of datasets in Appendix C.

### 7.1 Ablation study and comparison to strong MLN inference methods

We conduct experiments on Kinship, UW-CSE and Cora, since other baselines can only scale up to these datasets. We use the original logic formulae provided in UW-CSE and Cora, and use hand-coded rules for Kinship. The weights for all formulae are set to 1. We use area under the precision-recall curve (AUC-PR) to evaluate deductive inference accuracy for predicates never seen during training, and under the open-world setting.111In Appendix E, we report the performance under the closed-world setting as in the original works. See Appendix C for more details. Before comparing to other baselines, we first perform an ablation study for ExpressGNN in Cora to explore the trade-off between GNN and tunable embeddings.

Ablation study. The number of parameters in GNN is independent of entity size, but it is less expressive. The number of parameters in the tunable component is linear in entity size, but it is more expressive. Results on different combinations of these two components are shown in Table 2, which are consistent with our analytical result: GNN alone is not expressive enough.

It is observed that GNN64+Tune4 has comparable performance with Tune64, but consistently better than GNN64. However, the number of parameters in GNN64+Tune4 is , while that in Tune64 is

. A similar result is observed for GNN64+Tune64 and Tune128. Therefore, ExpressGNN as a combination of two types of embeddings can possess the advantages of both having a small number of parameters and being expressive. Therefore, we will use ExpressGNN throughout the rest of the experiments with hyperparameters optimized on the validation set. See Appendix

C for details.

Inference accuracy. We evaluate the inference accuracy of ExpressGNN against a number of state-of-the-art MLN inference algorithms: (i) MCMC (Gibbs Sampling) [19, 1]; (ii) Belief Propagation (BP) [20]; (iii) Lifted Belief Propagation (Lifted BP) [8]; (iv) MC-SAT [21]; (v) Hinge-Loss Markov Random Field (HL-MRF) [22]. Results are shown in Table 3.

A hyphen in the entry indicates that the inference is either out of memory or exceeds the time limit (24 hours). Note that since the lifted BP is guaranteed to get identical results as BP [8], the results of these two methods are merged into one row. For UW-CSE, the results suggest that ExpressGNN consistently outperforms all baselines. On synthetic Kinship, since the dataset is noise-free, HL-MRF achieves the score of 1 for the first four sets. ExpressGNN yields similar but not perfect scores for all the subsets, presumably caused by the stochastic nature of our sampling and optimization method.

Inference efficiency. The inference time on UW-CSE and Kinship are summarized in Figure 4 (Cora is omitted as none of the baselines is feasible). As the size of the dataset grows linearly, inference time of all baseline methods grows exponentially. ExpressGNN maintains a nearly constant inference time with the increasing size of the dataset, demonstrating strong scalability. For HL-MRF, while maintaining a comparatively short wall-clock time, it exhibits an exponential increase in the space complexity. Slower methods such as MCMC and BP becomes infeasible for large datasets. ExpressGNN outperforms all baseline methods by at least one or two orders of magnitude.

### 7.2 Large-scale knowledge base completion

We use a large-scale dataset, FB15K-237 [18]

, to show the scalability of ExpressGNN. Since none of the aforementioned probabilistic inference methods are tractable on this dataset, we compare with several state-of-the-art supervised methods for knowledge base completion: (i) Neural Logic Programming (Neural LP)

[23]

; (ii) Neural Tensor Network (NTN)

[24]; (iii) TransE [25]. In these knowledge completion experiments, we follow the setting in [10] to add a discriminative loss to better utilize observed data. We use Neural LP to generate candidate rules and pick up those with high confidence scores for ExpressGNN. See Appendix F for examples of logic formulae used in experiments. For competitor methods, we use default tuned hyperparameters, which can reproduce the experimental results reported in their original works.

Evaluation. Given a query, e.g., , the task is to rank the query on top of all possible grounding of . For evaluation, we compute the Mean Reciprocal Ranks (MRR), which is the average of the reciprocal rank of all the truth queries, and Hits@10, which is the percentage of truth queries that are ranked among top 10. Following the protocol proposed in [23, 25], we also use filtered rankings.

Data efficiency in transductive setting. We demonstrate the data efficiency of using logic formula and compare ExpressGNN with aforementioned supervised approaches. More precisely, we follow [23] to split the knowledge base into facts / training / validation / testing sets, vary the size of the training set from 0% to 100%, and feed the varied training set with the same complete facts set to models for training. Evaluations on testing set are given in Table 5. It shows with small training data ExpressGNN can generalize significantly better than supervised methods. With more supervision, supervised approaches start to close the gap with ExpressGNN. This also suggests that high confidence logic rules indeed help us generalize better under small training data.

Inductive ability. To demonstrate the inductive learning ability of ExpressGNN, we conduct experiments on FB15K-237 where training and testing use disjoint sets of both entities and relations. To prepare data for such setting, we first randomly select a subset of relations, and restrict the test set to relations in this selected subset, which is similar to [25]. Table 5 shows the experimental results. As expected, in this inductive setting, supervised transductive learning methods such as NTN and TransE drop to zero in terms of MRR and Hits@10222The MRR and Hits@10 are both smaller than 0.01 for NTN and TransE.. Neural LP performs inductive learning and generalizes well to new entities in the test set as discussed in [23]. However, in our inductive setting, where all the relations in the test set are new, Neural LP is not able to achieve good performance as reported in Table 5. In contrast, ExpressGNN can directly exploit first-order logic and is much less affected by the new relations, and achieve reasonable performance at the same scale as the non-inductive setting.

## 8 Conclusion

Our analysis shows that GNN while being suitable for probabilistic logic inference in MLN, is not expressive enough. Motivated by this analysis, we propose ExpressGNN, an integrated GNN and tunable embedding approach, which has a trade-off between model size and expressiveness, and leads to scalable and effective logic inference in both transductive and inductive experiments. ExpressGNN opens up many possibilities for future research such as formula weight learning with variational inference, incorporating entity features and neural tensorized logic formulae, and addressing challenging datasets such as GQA [26].

## References

• Richardson and Domingos [2006] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):107–136, 2006.
• Bruna et al. [2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In ICLR, 2014.
• Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
• Dai et al. [2016] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pages 2702–2711, 2016.
• Li et al. [2016] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. In ICLR, 2016.
• Kipf and Welling [2017] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
• Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
• Singla and Domingos [2008] Parag Singla and Pedro M Domingos. Lifted first-order belief propagation. In AAAI, volume 8, pages 1094–1099, 2008.
• Singla et al. [2014] Parag Singla, Aniruddh Nath, and Pedro M Domingos. Approximate lifting techniques for belief propagation. In

Twenty-Eighth AAAI Conference on Artificial Intelligence

, 2014.
• Meng et al. [2019] Qu Meng, Bengio Yoshua, and Tang Jian. Gmnn: Graph markov neural networks. arXiv preprint arXiv:1905.06214, 2019.
• Hoffman et al. [2013] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
• Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
• Shervashidze et al. [2011] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
• Garey and Johnson [2002] Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002.
• Babai [2016] László Babai. Graph isomorphism in quasipolynomial time. In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

, pages 684–697. ACM, 2016.
• Singla and Domingos [2005] Parag Singla and Pedro Domingos. Discriminative training of markov logic networks. In AAAI, volume 5, pages 868–873, 2005.
• Denham [1973] Woodrow W Denham. The detection of patterns in Alyawara nonverbal behavior. PhD thesis, University of Washington, Seattle., 1973.
• Toutanova and Chen [2015] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015.
• Gilks et al. [1995] Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall/CRC, 1995.
• Yedidia et al. [2001] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation. In Advances in neural information processing systems, pages 689–695, 2001.
• Poon and Domingos [2006] Hoifung Poon and Pedro Domingos. Sound and efficient inference with probabilistic and deterministic dependencies. In AAAI, volume 6, pages 458–463, 2006.
• Bach et al. [2015] Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random fields and probabilistic soft logic. arXiv preprint arXiv:1505.04406, 2015.
• Yang et al. [2017] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base completion. CoRR, abs/1702.08367, 2017.
• Socher et al. [2013] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934, 2013.
• Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
• Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506, 2019.
• Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

## Appendix A Proof of Theorems

See 5.1

###### Proof.

For simplicity, we use and to represent and in this proof.

• Let us first assume statement 1 is true and prove statement 2.

The neighbors of and are

 N\rbrH=\cbr\rbrci,i:i=1,…,n and%  N\rbrH′=\cbr\rbrc′i,i:i=1,…,n (4)

where represents the edge type. It is easy to see that and are indistinguishable in if and only if and are indistinguishable in for . By statement 1, and are indistinguishable in if and only if and are indistinguishable in . Hence, statement 2 is true. Now it remains to prove statement 1.

• () If and are distinguishable in , it is easy to see and are also distinguishable in the new graph . The reason is that the newly added nodes are of different types from the observed nodes in , so that these newly added nodes can not make two distinguishable nodes to become indistinguishable.

() Assume that and are indistinguishable in , we will prove they are indistinguishable in using MI (mathematical induction). The idea is to construct the new graph by connecting the unobserved nodes in in a particular order. More specifically, we first connect all unobserved grounded predicates to their first arguments , and the resulting graph is denoted by . Then we can connect all to their second arguments and denote the resulting graph by . In this way, we obtain a sequence of graphs where is the maximal number of arguments. It is clear that . In the following, we will use MI to prove that for all , if and are indistinguishable in , then they are indistinguishable in .

Proof of (MI 1):

Consider any predicate . For any two indistinguishable nodes in , . Hence, it is obvious that

 #\cbrr(c,…): unobserved=#\cbrr(c′,…): unobserved=M. (5)

Before connected to the graph , the unobserved nodes are all indistinguishable because they are of the same node-type. Now we connect all these unobserved nodes to its first argument. Then is connected to and is connected to . Since both and are connected unobserved nodes and these nodes are indistinguishable, and remain to be indistinguishable. Also, after connected to its first argument, and are indistinguishable if and only if and are indistinguishable, which is obvious.

Similarly, we can connect all unobserved grounded predicates to its first argument. In the resulting graph, two nodes are indistinguishable if they are indistinguishable in .

Assumption (MI ):

Assume that after connecting all unobserved grounded predicates to their first arguments, the constant nodes in the resulting graph, , are indistinguishable if they are indistinguishable in .

Proof of (MI ):

The constants in can be partitioned into groups , where the constants in the same group are indistinguishable in (and also indistinguishable in ).

Consider a predicate . The set of unobserved grounded predicates where the -th argument is can be written as

 \cbrr∗(…,ck+1=c,…): unobserved (6) = N⋃i1=1⋯N⋃ik=1\cbrr∗(c1,…,ck,ck+1=c,…):ci∈C(i1),…,ck∈C(ik), unobserved. (7)

Similar to the arguments in (MI 1), for any two indistinguishable nodes and , for any fixed sequence of groups , the size of the following two sets are the same:

 M(i1,…,ik)= #\cbrr∗(c1,…,ck,ck+1=c,…):ci∈C(i1),…,ck∈C(ik), unobserved (8) = \scalebox0.9$#\cbrr∗(c1,…,ck,ck+1=c′,…):ci∈C(i1),…,ck∈C(ik), unobserved$. (9)

Also, all grounded predicates in the above two sets are indistinguishable in because their first arguments are indistinguishable. Hence, these are two sets of many indistinguishable nodes. In conclusion, the two sets and are indistinguishable in if and are indistinguishable.

Now we connect all unobserved to their -th arguments. Then the constant node is connected to and is connected to . Since these two sets are indistinguishable, then and remain to be indistinguishable.

Similarly, for other predicates , we can connect all unobserved grounded predicates to their -th arguments. In the resulting graph, , any pair of two nodes will remain indistinguishable if they are indistinguishable in .

See 5.2

###### Proof.

A graph isomorphism from to itself is called automorphism, so in this proof, we will use the terminology - automorphism - to indicate such a self-bijection.

We first prove the sufficient condition:

If automorphism on the graph such that , then for any , and have the same posterior in any MLN.

MLN is a graphical model that can also be represented by a factor graph where grounded predicates (random variables) and grounded formulae (potential) are connected. We will show that an automorphism on MLN such that . Then the sufficient condition is true. This automorphism is easy to construct using the automorphism on . More precisely, we define as

 ϕ(r(ar))=r(π(ar)),  ϕ(f(af))=f(π(af)), (10)

for any predicate , any assignments to its arguments, any formula , and any assignments to its arguments. It is easy to see is an automorphism:

1. [leftmargin=*,nolistsep]

2. Since is a bijection, apparently is also a bijection.

3. The above definition preserves the biding of the arguments. and are connected if and only if and are connected.

4. Given the definition of , we know that and have the same observation value. Therefore, in MLN, .

This completes the proof of the sufficient condition.

To prove the necessary condition, it is equivalent to show given the following assumption

(A 1): there is no automorphism on the graph such that ,

the following statement is true:

there must exists a MLN and a predicate in it such that and have different posterior.

Before showing this, let us first introduce the factor graph representation of a single logic formula .

A logic formula can be represented as a factor graph, , where nodes on one side of the graph is the set of distinct constants needed in the formula, while nodes on the other side is the set of predicates used to define the formula. The set of edges, , will connect constants to predicates or predicate negation. That is, an edge between node and predicate exists, if the predicate use constant in its -th argument. We note that the set of distinctive constants used in the definition of logic formula are templates where actual constant can be instantiated from . An illustration of logic formula factor graph can be found in Figure 5. Similar to the factor graph for the knowledge base, we also differentiate the type of edges by the position of the argument. Figure 5: An example of factor graph for the logic formula

Therefore, every single formula can be represented by a factor graph. We will construct a factor graph representation to define a particular formula, and show that the MLN induced by this formula will result in different posteriors for and . The factor graph for the formula is constructed in the following way (See Figure 8 as an example of the resulting formula constructed using the following steps):

1. Given the above assumption (A 1), we claim that:

a subgraph such that all subgraphs satisfy:

(Condition) if there exists an isomorphism satisfying after the observation values are IGNORED (that is, and are treated as the SAME type of nodes), then the set of fact nodes (observations) in these two graphs are different (that is, ).

The proof of this claim is given at the end of this proof.

2. Next, we use to define a formula . We first initialize the definition of the formula value as

 f(c1,…,cn,~c1,…,~cn)=\rbr∧\cbr~r(a~r):~r(a~r)∈G∗c1:n⇒r(c1,…,cn). (11)

Then, we change in this formula to the negation if the observed value of is 0 in .

We have defined a formula