Evaluating Logical Generalization in Graph Neural Networks

03/14/2020 ∙ by Koustuv Sinha, et al. ∙ 9

Recent research has highlighted the role of relational inductive biases in building learning agents that can generalize and reason in a compositional manner. However, while relational learning algorithms such as graph neural networks (GNNs) show promise, we do not understand how effectively these approaches can adapt to new tasks. In this work, we study the task of logical generalization using GNNs by designing a benchmark suite grounded in first-order logic. Our benchmark suite, GraphLog, requires that learning algorithms perform rule induction in different synthetic logics, represented as knowledge graphs. GraphLog consists of relation prediction tasks on 57 distinct logical domains. We use GraphLog to evaluate GNNs in three different setups: single-task supervised learning, multi-task pretraining, and continual learning. Unlike previous benchmarks, our approach allows us to precisely control the logical relationship between the different tasks. We find that the ability for models to generalize and adapt is strongly determined by the diversity of the logical rules they encounter during training, and our results highlight new challenges for the design of GNN models. We publicly release the dataset and code used to generate and interact with the dataset at https://www.cs.mcgill.ca/ ksinha4/graphlog.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relational reasoning, or the ability to reason about the relationship between objects entities in the environment, is considered a fundamental aspect of intelligence (Krawczyk et al., 2011; Halford et al., 2010). Relational reasoning is known to play a critical role in cognitive growth of children (Son et al., 2011; Farrington-Flint et al., 2007; Richland et al., 2010). This ability to infer relations between objects/entities/situations, and to compose relations into higher-order relations, is one of the reasons why humans quickly learn how to solve new tasks (Holyoak and Morrison, 2012; Alexander, 2016).

Figure 1: GraphLog setup. We define a large set of rules that are grounded in propositional logic. We partition the rule set into overlapping subsets, which we use to define the unique worlds, . Finally, within each world , we generate several knowledge graphs that are governed by the rule set of .

The perceived importance of relational reasoning for generalization capabilities has fueled the development of several neural network architectures that incorporate relational inductive biases (Battaglia et al., 2016; Santoro et al., 2017; Battaglia et al., 2018). Graph neural networks (GNNs), in particular, have emerged as a dominant computational paradigm within this growing area (Scarselli et al., 2008; Hamilton et al., 2017a; Gilmer et al., 2017; Schlichtkrull et al., 2018; Du et al., 2019). However, despite the growing interest in GNNs and their promise for improving the generalization capabilities of neural networks, we currently lack an understanding of how effectively these models can adapt and generalize across distinct tasks.

In this work, we study the task of logical generalization, in the context of relational reasoning using GNNs. In particular, we study how GNNs can induce logical rules and generalize by combining these rules in novel ways after training. We propose a benchmark suite, GraphLog, that is grounded in first-order logic. Figure 1 shows the setup of the benchmark. Given a set of logical rules, we create different logical worlds with overlapping rules. For each world (say ), we sample multiple knowledge graphs (say ). The learning agent should learn to induce the logical rules for predicting the missing facts in these knowledge graphs. Using our benchmark, we evaluate the generalization capabilities of GNNs in a supervised setting by predicting unseen combinations of known rules within a specific logical world. This task that explicitly requires inductive generalization. We further analyze how various GNN architectures perform in the multi-task and the continual learning scenarios, where they have to learn over a set of logical worlds with different underlying logic. Our setup allows us to control the similarity between the different worlds by controlling the overlap in logical rules between different worlds. This enables us to precisely analyze how task similarity impacts performance in the multi-task setting.

Our analysis provides the following useful insights regarding the logical generalization capabilities of GNNs:

  • [itemsep=3pt,topsep=0pt,parsep=0pt]

  • Two architecture choices for GNNs have a strong positive impact on the generalization performance: 1) incorporating multi-relational edge features using attention, and 2) explicitly modularising the GNN architecture to include a parametric representation function, which learns representations for the relations based on the knowledge graph structure.

  • In the multi-task setting, training a model on a more diverse set of logical worlds improves generalization and adaptation performance.

  • All the evaluated models exhibit catastrophic forgetting in the continual learning setting. This indicates that the models are prone to fitting to just the current task at hand and not learning representations and compositions that can transfer across tasks—highlighting the challenge of lifelong learning in the context of logical generalization and GNNs.

2 Background and Related Work

Graph Neural Networks. Several graph neural network (GNN) architectures have been proposed to learn the representation for the graph input (Scarselli et al., 2008; Duvenaud et al., 2015; Defferrard et al., 2016; Kipf and Welling, 2016; Gilmer et al., 2017; Veličković et al., 2017; Hamilton et al., 2017b; Schlichtkrull et al., 2018). Previous works have focused on evaluating graph neural networks in terms of their expressive power (Morris et al., 2019; Xu et al., 2018), usefulness of features (Chen et al., 2019), and explaining the predictions from GNNs (Ying et al., 2019). Complementing these works, we evaluate GNN models on the task of logical generalization.

Knowledge graph completion. Many knowledge graph datasets are available for the task of relation prediction (also known as knowledge base completion). Prominent examples include Freebase15K (Bordes et al., 2013), WordNet (Miller, 1995), NELL (Mitchell and Fredkin, 2014), and YAGO (Suchanek et al., 2007; Hoffart et al., 2011; Mahdisoltani et al., 2013). These datasets are derived from real-world knowledge graphs and are useful for empirical evaluation of relation prediction systems. However, these datasets are generally noisy and incomplete, as many facts are not available in the underlying knowledge bases (West et al., 2014; Paulheim, 2017). Moreover, the logical rules underpinning these systems are often opaque and implicit (Guo et al., 2016). All these shortcomings reduce the usefulness of existing knowledge graph datasets for understanding the logical generalization capability of neural networks. Some of these limitations can be overcome by using synthetic datasets, which can provide a high degree of control and flexibility over the data generation process at a low cost. Synthetic datasets are useful for understanding the behavior of different models - especially when the underlying problem can have many factors of variations. We consider using synthetic datasets, as a means and not an end, to understand the logical generalization capability of GNNs.

Our GraphLog benchmark serves as a synthetic complement to the real-world datasets. Instead of sampling from a real-world knowledge base, we create synthetic knowledge graphs that are governed by a known and inspectable set of logical rules. Moreover, the relations in GraphLog are self-contained and do not require any common-sense knowledge, thus making the tasks self-contained.

Procedurally generated datasets for reasoning. In recent years, several procedurally generated benchmarks have been proposed to study the relational reasoning and compositional generalization properties of neural networks. Some recent and prominent examples are listed in Table 1. These datasets aim to provide a controlled testbed for evaluating the compositional reasoning capabilities of neural networks in isolation. Based on these existing works and their insightful observations, we enumerate the four key desiderata that, we believe, such a benchmark should provide:

  1. [itemsep=1pt,topsep=0pt, parsep=0pt, leftmargin=*]

  2. Interpretable Rules: The rules that are used to procedurally generate the dataset should be human interpretable.

  3. Diversity: The benchmark datasets should have enough diversity across different tasks, and the compositional rules used to solve different tasks should be distinct, so that adaptation on a novel task is not trivial. The degree of similarity across the tasks should be configurable to enable evaluating the role of diversity in generalization.

  4. Compositional generalization: The benchmark should require compositional generalization, i.e., generalization to unseen combinations of rules.

  5. Number of tasks: The benchmark should support creating a large number of tasks. This enables a more fine-grained inspection of the generalization capabilities of the model in different setups, e.g., supervised learning, multitask learning, and continual learning.

As shown in Table 1, GraphLog is unique in satisfying all of these desiderata. We highlight that GraphLog is the only dataset specifically designed to test logical generalization capabilities on graph data, whereas previous works have largely focused on the image and text modalities.

Dataset IR D CG M S Me Mu CL
CLEVR (Johnson et al., 2017) Vision
CoGenT (Johnson et al., 2017) Vision
CLUTRR (Sinha et al., 2019) Text
SCAN (Lake and Baroni, 2017) Text
SQoOP (Bahdanau et al., 2018) Vision
TextWorld (Côté et al., 2018) Text
GraphLog (Proposed) Graph
Table 1: Features of related datasets that are: 1) designed to test compositional generalization and reasoning, and 2) procedurally gnerated. We compare the datasets along the following dimensions: Inspectable Rules (IR), Diversity, Compositional Generalization (CG), Modality and if the following training setups are supported: Supervised, Meta-learning, Multitask & Continual learning (CL).

3 GraphLog

3.1 Terminology

A graph is a collection of a set of nodes and a set of edges between the nodes. In this work, we assume that each pair of nodes have at most one edge between them. A relational graph is a graph where the edge between two nodes (say and ) is assigned a label, denoted . The labeled edge is denoted as . A relation set is a set of relations , , … . A rule set is a set of rules in first order logic, which we denote in the Datalog format (Evans and Grefenstette, 2017), , and which can be expanded as Horn clauses of the form:


where denotes a variable that can be bound to any entity and denotes logical implication. The relations form the body while the relation forms the head of the rule. Horn clauses of this form represent a well-defined subset of first-order logic, and they encompass the types of logical rules learned by the vast majority of existing rule induction engines for knowledge graphs (Langley and Simon, 1995).

We use to denote a path from node to in a graph . We construct graphs according to rules of the form in Equation 1 so that a path between two nodes will always imply a specific relation between these two nodes. In other words, we will always have that


Thus, following the path between two nodes, and applying the propositional rules along the edges of the path, we can resolve the relationship between the nodes. Hence, we refer to the paths as resolution paths. The edges of the resolution path are concatenated together to obtain a descriptor. These descriptors are used for quantifying the similarity between different resolution paths, with a higher overlap between the descriptors implying a greater similarity between two resolution paths.

3.2 Problem Setup

We formulate the relational reasoning task as predicting relations between the nodes in a relational graph. Given a query where , the learner has to predict the relation for the edge . Unlike the previous work on knowledge graph completion, we emphasize an inductive problem setup, where the graph in each query is unique. Rather than reasoning on a single static knowledge graph during training and testing, we consider the setting where the model must learn to generalize to unseen graphs during evaluation.

3.3 Dataset Generation

As discussed in Section 2, we want our proposed benchmark to provide four key desiderata: (i) interpretable rules, (ii) diversity, (iii) compositional generalization and (iv) large number of tasks. We describe how our dataset generation process ensures all four aspects.

Rule generation. We create a set of relations and use it to sample a rule set . We impose two constraints on : (i) No two rules in can have the same body. This ensures consistency between the rules. (ii) Rules cannot have common relations among the head and body. This ensures the absence of cyclic dependencies in rules (Hamilton et al., 2018). Generating the dataset using a consistent and well-defined rule set ensures interpretability in the resulting dataset. The full algorithm for rule generation is given in Appendix (Algorithm 1).

Graph generation. The graph generation process has two steps: In the first step, we recursively sample and use rules in to generate a relational graph called the WorldGraph (as shown in Figure 1). This sampling procedure enables us to create a diverse set of WorldGraphs by considering only certain subsets (of ) during sampling. By controlling the extent of overlap between the subsets of (in terms of the number of rules that are common across the subsets), we can precisely control the similarity between the different WorldGraphs. The full algorithm for generating the WorldGraph and controlling the similarity between the worlds is given in Appendix (Algorithm 3 and Section A.2).

In the second step, the WorldGraph is used to sample a set of graphs (shown as Step (a) in Figure 2). A graph is sampled from by sampling a pair of nodes from and then by sampling a resolution path . The edge between the source and sink node of the path provides the target relation for the learning model to predict. To increase the complexity of the sampled graphs (beyond being simple paths), we also add nodes to by sampling neighbors of the nodes on , such that no other shortest path exists between and . Algorithm 4 (in the Appendix) details our graph sampling approach.

3.4 Summary of the GraphLog Dataset

We use the data generation process described in Section 3.3 to instantiate a dataset suite with 57 distinct logical worlds and graphs per world (Figure 1). The dataset is divided into the sets of training, validation, and testing worlds. The graphs within each world are also split into training, validation, and testing sets. The key statistics of the datasets are given in Table 2. Though we instantiate 57 worlds, the GraphLog code can instantiate an arbitrary number of worlds and has been included in the supplementary material.

3.4.1 Setups supported in GraphLog

GraphLog enables us to investigate the logical relational reasoning performance of models in the following setups:

Number of relations 20
Total number of WorldGraphs 57
Total number of unique rules 76
Training Graphs per WorldGraph 5000
Validation Graphs per WorldGraph 1000
Testing Graphs per WorldGraph 1000
Number of rules per WorldGraph 20
Average number of descriptors 522
Maximum length of resolution path 10
Minimum length of resolution path 2
Table 2: Aggregate statistics of the worlds used in GraphLog. Statistics for each individual world are in the Appendix.

Supervised learning. In the supervised learning setup, a model is trained on the train split of a single logical world and evaluated on the test split of the same world. The total number of rules grows exponentially with the number of relations , making it impossible to train on all possible combinations of the relations. However, we expect that a perfectly systematic model generalizes to unseen combinations of relations by training only on a subset of combinations (i.e., via inductive reasoning).

Multi-task learning. GraphLog provides multiple logical worlds, each with its own training and evaluation splits. In the standard multi-task training, the model is trained on the train split of many worlds ( and evaluated on the test split of the same worlds. The complexity of each world and the similarity between the different worlds can be precisely controlled. GraphLog thus enables us to evaluate how model performance varies when the model is trained on similar vs. dissimilar worlds.

GraphLog is also designed to study the effect of pre-training on adaptation. In this setup, the model is first pre-trained on the train split of multiple worlds () and then adapted (fine-tuned) on the train split of the unseen heldout worlds (. The model is evaluated on the test split of the novel worlds. Similar to the previous setup, GraphLog provides us an opportunity to investigate the effect of similarity in pre-training. This enables GraphLog to mimic in-distribution and out-of-distribution training and testing scenarios, as well as precisely categorize the effect of multi-task pre-training for adaptation performance.

Continual learning. GraphLog provides access to a large number of worlds, enabling us to evaluate the logical generalization capability of the models in the continual learning setup. In this setup, the model is trained on a sequence of worlds. Before training on a new world, the model is evaluated on all the worlds that the model has trained on so far. Given the several challenges involved in continual learning (Thrun and Pratt, 2012; Parisi et al., 2019; De Lange et al., 2019; Sodhani et al., 2019), we do not expect the models to be able to remember the knowledge from all the previous tasks. Nonetheless, given that we are evaluating the models for relational reasoning and that our datasets share relations, we would expect the models to retain some knowledge of how to solve the previous tasks. In this sense, the performance on the previous tasks can also be seen as an indicator of if the models actually learn to solve the relational reasoning tasks or they just fit to the current dataset distribution.

Figure 2: Overview of the training process: (a): Sampling multiple graphs from . (b): Converting the relational graph into extended graph . Note that edges of different color (denoting different types of relations) are replaced by a node of same type in . (c): Learning representations of the relations () using with the extended graph as the input. In case of Param models, the relation representations are parameterized via an embedding layer and the extended graph is not created. (d, e): The composition function takes as input the query and the relational representation . (f): The composition function predicts the relation between the nodes and .

4 Representation and Composition

In this section, we describe the graph neural network (GNN) architectures that we evaluate on the GraphLog benchmark. In order to perform well on the benchmark tasks, a model should learn representations that are useful for solving the tasks in the current world while being general enough to be effectively adapted to the new worlds. To this end, we structure the GNN models we analyze around two key modules:

  • [itemsep=0pt,parsep=0pt,topsep=0pt,leftmargin=*]

  • Representation module: This module is represented as a function , which maps logical relations within a particular world to

    -dimensional vector representations. Intuitively, this function should learn how to encode the

    semantics of the various relations within a logical world.

  • Composition module: This module is a function , which learns how to compose the relation representations learned by to make predictions about queries over a knowledge graph.

Note that though we break down the process into two steps, in practice, the learner does not have access to the correct representations of relations or to . The learner has to rely only on the target labels to solve the reasoning task. We hypothesize that this separation of concerns between a representation function and a composition function (Dijkstra, 1982) could provide a useful inductive bias for the model.

4.1 Representation modules

We first describe the different approaches for learning the representation for the relations. These representations will be provided as input to the composition function.

Direct parameterization. The simplest approach to define the representation module is to train unique embeddings for each relation . This approach is predominantly used in the previous work on GNNs (Gilmer et al., 2017; Veličković et al., 2017), and we term this approach as the Param representation module. A major limitation of this approach is that the relation representations are optimized specifically for each logical world, and there is no inductive bias towards learning representations that can generalize.

Learning representations from the graph structure. In order to define a more powerful and expressive representation function, we consider an approach that learns relation representations as a function of the WorldGraph underlying a logical world. To do so, we consider an “extended” form of the WorldGraph, , where introduce new nodes (called edge-nodes) corresponding to each edge in the original WorldGraph . For an edge , the corresponding edge-node is connected to only those nodes that were incident to it in the original graph (i.e. nodes and ; see Figure 2, Step (b)). This new graph only has one type of edge and comprises of nodes from both the original graph and from the set of edge-nodes.

We learn the relation representations by training a GNN model on the expanded WorldGraph and by averaging the edge-node embeddings corresponding to each relation type . (Step (c) in Figure 2). For the GNN model, we consider the Graph Convolutional Network (GCN) (Kipf and Welling, 2016) and the Graph Attention Network (GAT) architectures. Since the nodes do not have any features or attributes, we randomly initialize the embeddings in these GNN message passing layers.

The intuition behind creating the extended-graph is that the representation GNN function can learn the relation embeddings based on the structure of the complete relational graph . We expect this to provide an inductive bias that can generalize more effectively than the simple Param approach. Finally, note that while the representation function is given access to the WorldGraph to learn representations for relations, the composition module is not able to interface with the WorldGraph in order to make predictions about a query.

4.2 Composition modules

We now describe the GNNs used for the composition modules. These models take as input the query and the relation embedding (Step (d) and (e) in Figure 2).

Relational Graph Convolutional Network (RGCN). Given that the input to the composition module is a relational graph, the RGCN model (Schlichtkrull et al., 2018) is a natural choice for a baseline architecture. In this approach, we iterate a series of message passing operations:

where denotes the representation for a node at the layer of the model,

is a learnable tensor,

is the representation for relation , and denotes the neighbors of node by relation . We use to denote multiplication across a particular mode of the tensor. This RGCN model learns a relation-specific propagation matrix, specified by the interaction between the relation embedding and the shared tensor .111Note that the shared tensor is equivalent to the basis matrix formulation in Schlichtkrull et al. (2018).

Edge-based Graph Attention Network (Edge-GAT). In addition to the RGCN model—which is considered the defacto standard architecture for applying GNNs to multi-relational data—we also explore an extension of the Graph Attention Network (GAT) model (Veličković et al., 2017) to handle edge types. Many recent works have highlighted the importance of the attention mechanism, especially in the context of relational reasoning (Vaswani et al., 2017; Santoro et al., 2018; Schlag et al., 2019). Motivated by this observation, we investigate an extended version of the GAT, where we incorporate gating via an LSTM (Hochreiter and Schmidhuber, 1997) and where the attention is conditioned on both the incoming message (from the other nodes) and the relation embedding (of the other nodes):

Following the original GAT model, the attention function is defined using an dense neural network on the concatenation of the input vectors. We refer to this model as the Edge GAT (E-GAT) model.

Query and node representations. We predict the relation for a given query by concatenating (the final-layer query node embeddings, assuming a -layer GNN) and applying a two-layer dense neural network (Step (f) in Figure 2). The entire model (i.e., the representation function and the composition function) are trained end-to-end using the softmax cross-entropy loss. Since we have no node features, we randomly initialize all the node embeddings in the GNNs (i.e., ).

5 Experiments

Figure 3: We categorize the datasets in terms of their relative difficulty (see Appendix). We observe that the models using E-GAT as the composition function consistently work well.

We aim to quantify the performance of the different GNN models on the task of logical relation reasoning, in three contexts: (i) Single Task Supervised Learning, (ii) Multi-Task Training and (iii) Continual Learning. Our experiments use the GraphLog benchmark with distinct 57 worlds or knowledge graph datasets (see Section 3) and 6 different different GNN models (see Section 4

). In the main paper, we share the key trends and observations that hold across the different combinations of the models and the datasets, along with some representative results. The full set of results is provided in the Appendix. All the models are implemented using PyTorch 1.3.1

(Paszke et al., 2019). The code has been included with the supplemental material.

5.1 Single Task Supervised Learning

In our first setup, we train and evaluate all of the models on all the 57 worlds, one model, and one world pair at a time. This experiment provides several important results. Previous works considered only a handful of datasets when evaluating the different models on the task of relational reasoning. As such, it is possible to design a model that can exploit the biases present in the few datasets that the model is being evaluated over. In our case, we consider over 50 datasets, with different characteristics (Table 2). It is difficult for one model to outperform the other models on all the datasets just by exploiting some dataset-specific bias, thereby making the conclusions more robust.

In Figure 3, we present the results for the different models. We categorize the worlds in three categories of difficultyeasy, moderate and difficult – based on relative test performance of the models on each world. Table 6 (in Appendix) contains the results for the different models on the individual worlds. We observe that the models using E-GAT as the composition functions always outperform their counterparts using the RGCN models. This confirms our hypothesis about the usefulness of combining relational reasoning and attention for improving the performance on relational reasoning tasks. An interesting observation is that the relative ordering among the worlds, in terms of the test accuracy of the different models, is consistent irrespective of the model we use, highlighting the intrinsic difficulty of the different worlds in GraphLog.

5.2 Multi-Task Training

Accuracy Accuracy
GAT E-GAT 0.534 0.534
GAT RGCN 0.474 0.502
GCN E-GAT 0.522 0.533
GCN RGCN 0.448 0.476
Param E-GAT 0.507 0.5
Param RGCN 0.416 0.449
Table 3: Multitask evaluation performance when trained on different data distributions. We categorize the training distribution on basis of their similarity of rules: Similar (S) containing similar worlds and a mix of similar and dissimilar worlds (D)

We now turn to the setting of multi-task learning where we train the same model on multiple logical worlds.

Figure 4: We run multitask experiments over an increasing number of worlds to stretch the capacity of the models. Evaluation performance is reported as the average of test set performance across the worlds that the model has trained on so far. All the models reach their optimal performance at 20 worlds, beyond which their performance starts to degrade.

Basic multi-task training. First, we evaluate a how changing the similarity among the training worlds affects the test performance in the multi-task setup, where a model is trained jointly on eight and tested on three distinct worlds. In Table 3, we observe that considering a mix of similar and dissimilar worlds improves the generalization capabilities of all the models when evaluated on the test split. Another important observation is that just like the supervised learning setup, the GAT-EGAT model consistently performs either as good as or better than other models and the models using EGAT for the composition function perform better than the ones using the RGCN model. Figure 4 shows how the performance of the various models changes when we perform multi-task training on an increasingly large set of worlds. Interestingly, we see that model performance improves as the number of worlds is increased from 10 to 20 but then begins to decline, indicating capacity saturation in the presence of too many diverse worlds.

Multi-task pre-training. In this setup, we pre-train the model on multiple worlds and adapt on a heldout world. We study how the models’ adaption capabilities vary as the similarity between the training and the evaluation distributions changes. Figure 5 considers the case of zero-shot adaptation and adaptation till convergence. As we move along the x-axis, the zero-shot performance (shown with solid colors) decreases in all the setups. This is expected as the similarity between the training and the evaluation distributions also decreases. An interesting trend is that the model’s performance, after adaptation, increases as the similarity between the two distributions decreases. This suggests that training over a diverse set of distributions improves adaptation capability. The results for adaptation with 5, 10, … 30 steps are provided in the Appendix (Figure 8).

Figure 5: We evaluate the effect of changing the similarity between the training and the evaluation datasets. The colors of the bars depicts how similar the two distributions are while the y-axis shows the mean accuracy of the model on the test split of the evaluation world. We report both the zero-shot adaptation performance and performance after convergence.
Figure 6: We evaluate the performance of all the models in a continual learning setup. The blue curve shows the accuracy on the current world and the orange curve shows the mean accuracy on all the previously seen worlds. As the model trains on new worlds, its performance on the previously seen worlds degrades rapidly. This is the forgetting effect commonly encountered in continual learning setups.
Figure 7: We evaluate the performance in a continual learning setup where we share either the representation function or the composition function or both. We observe that sharing the representation function reduces the effect of catastrophic forgetting as compared to the other setups.

5.3 Continual Learning Setup

In the continual learning setup, we evaluate the knowledge retention capabilities of the GNN models. We train the model on a sequence of overlapping worlds, and after converging on every world, we report the average of model’s performance on all the previous worlds. In Figure 6 we observe that as the model is trained on different worlds, the performance on the previous worlds degrades rapidly. This highlights that the current reasoning models are not suitable for continual learning.

The role of the representation function. We also investigate the model’s performance in a continual learning setup where the model learns only a world-specific representation function or a world-specific composition function, and where the other module is shared across the worlds. In Figure 7, we observe that sharing the representation function reduces the effect of catastrophic forgetting, but sharing the composition function does not have the same effect. This suggests that the representation function learns representations that are useful across the worlds.

6 Discussion & Conclusion

In this work, we propose GraphLog, a benchmark suite for evaluating the logical generalization capabilities of Graph Neural Networks. GraphLog is grounded in first-order logic and provides access to a large number of diverse tasks that require compositional generalization to solve, including single task supervised learning, multi-task learning, and continual learning. Our results highlight the importance of attention mechanisms and modularity to achieve logical generalization, while also highlighting open challenges related to multi-task and continual learning in the context of GNNs. A natural direction for future work is leveraging GraphLog for studies of fast adaptation and meta-learning in the context of logical reasoning (e.g., via gradient-based meta learning), as well as integrating state-of-the-art methods (e.g., regularization techniques) to combat catastrophic forgetting in the context of GNNs.


  • P. A. Alexander (2016) Relational thinking and relational reasoning: harnessing the power of patterning. NPJ science of learning 1 (1), pp. 1–7. Cited by: §1.
  • D. Bahdanau, S. Murty, M. Noukhovitch, T. H. Nguyen, H. de Vries, and A. Courville (2018) Systematic generalization: what is required and can it be learned?. arXiv preprint arXiv:1811.12889. Cited by: Table 1.
  • P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502–4510. Cited by: §1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018)

    Relational inductive biases, deep learning, and graph networks

    arXiv preprint arXiv:1806.01261. Cited by: §1.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: Appendix D.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §2.
  • T. Chen, S. Bian, and Y. Sun (2019) Are powerful graph neural nets necessary? a dissection on graph classification. arXiv preprint arXiv:1905.04579. Cited by: §2.
  • M. Côté, Á. Kádár, X. Yuan, B. Kybartas, T. Barnes, E. Fine, J. Moore, M. Hausknecht, L. El Asri, M. Adada, et al. (2018) Textworld: a learning environment for text-based games. In Workshop on Computer Games, pp. 41–75. Cited by: Table 1.
  • M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383. Cited by: §3.4.1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.
  • E. W. Dijkstra (1982) On the role of scientific thought. In Selected writings on computing: a personal perspective, pp. 60–66. Cited by: §4.
  • S. S. Du, K. Hou, R. R. Salakhutdinov, B. Poczos, R. Wang, and K. Xu (2019) Graph neural tangent kernel: fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems, pp. 5724–5734. Cited by: §1.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.
  • R. Evans and E. Grefenstette (2017) Learning Explanatory Rules from Noisy Data. Cited by: §3.1.
  • L. Farrington-Flint, K. H. Canobi, C. Wood, and D. Faulkner (2007) The role of relational reasoning in children’s addition concepts. British Journal of Developmental Psychology 25 (2), pp. 227–246. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1, §2, §4.1.
  • M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 270–287. Cited by: §A.4.
  • S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo (2016) Jointly embedding knowledge graphs and logical rules. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    pp. 192–202. Cited by: §2.
  • G. S. Halford, W. H. Wilson, and S. Phillips (2010) Relational knowledge: the foundation of higher cognition. Trends in cognitive sciences 14 (11), pp. 497–505. Cited by: §1.
  • W. Hamilton, R. Ying, and J. Leskovec (2017a) Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §1.
  • W. Hamilton, P. Bajaj, M. Zitnik, D. Jurafsky, and J. Leskovec (2018) Embedding logical queries on knowledge graphs. In Advances in Neural Information Processing Systems 31, pp. 2026–2037. Cited by: §3.3.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017b) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
  • J. Hoffart, F. M. Suchanek, K. Berberich, E. Lewis-Kelham, G. De Melo, and G. Weikum (2011) YAGO2: exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th international conference companion on World wide web, pp. 229–232. Cited by: §2.
  • K. J. Holyoak and R. G. Morrison (2012) The oxford handbook of thinking and reasoning. Oxford University Press. Cited by: §1.
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2901–2910. Cited by: Table 1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.1.
  • D. C. Krawczyk, M. M. McClelland, and C. M. Donovan (2011) A hierarchy for relational reasoning in the prefrontal cortex. Cortex 47 (5), pp. 588–597. Cited by: §1.
  • B. M. Lake and M. Baroni (2017) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: Table 1.
  • P. Langley and H. A. Simon (1995) Applications of machine learning and rule induction. Communications of the ACM 38 (11), pp. 54–64. Cited by: §3.1.
  • F. Mahdisoltani, J. Biega, and F. M. Suchanek (2013) Yago3: a knowledge base from multilingual wikipedias. Cited by: §2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §2.
  • T. Mitchell and E. Fredkin (2014) Never ending language learning. In Big Data (Big Data), 2014 IEEE International Conference on, pp. 1–1. Cited by: §2.
  • C. Morris, M. Ritzert, M. Fey, W.L. Hamilton, J. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and Leman go neural: higher-order graph neural networks. In AAAI, Cited by: §2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §3.4.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §5.
  • H. Paulheim (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic web 8 (3), pp. 489–508. Cited by: §2.
  • L. E. Richland, T. Chan, R. G. Morrison, and T. K. Au (2010) Young children’s analogical reasoning across cultures: similarities and differences. Journal of Experimental Child Psychology 105 (1-2), pp. 146–153. Cited by: §1.
  • A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. Lillicrap (2018)

    Relational recurrent neural networks

    In Advances in neural information processing systems, pp. 7299–7310. Cited by: §4.2.
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1, §2.
  • I. Schlag, P. Smolensky, R. Fernandez, N. Jojic, J. Schmidhuber, and J. Gao (2019) Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611. Cited by: §4.2.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §1, §2, §4.2, footnote 1.
  • K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton (2019) Clutrr: a diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177. Cited by: Table 1.
  • S. Sodhani, S. Chandar, and Y. Bengio (2019) On training recurrent neural networks for lifelong learning. Cited by: §3.4.1.
  • J. Y. Son, L. B. Smith, and R. L. Goldstone (2011) Connecting instances to promote children’s relational reasoning. Journal of experimental child psychology 108 (2), pp. 260–277. Cited by: §1.
  • F. M. Suchanek, G. Kasneci, and G. Weikum (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp. 697–706. Cited by: §2.
  • S. Thrun and L. Pratt (2012) Learning to learn. Springer Science & Business Media. Cited by: §3.4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2, §4.1, §4.2.
  • R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin (2014) Knowledge base completion via search-based question answering. In Proceedings of the 23rd international conference on World wide web, pp. 515–526. Cited by: §2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
  • Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec (2019) Gnnexplainer: generating explanations for graph neural networks. In Advances in Neural Information Processing Systems, pp. 9240–9251. Cited by: §2.

Appendix A GraphLog

a.1 Extended Terminology

In this section, we extend the terminology introduced in Section 3.1. A set of relations is said to be Invertible if


i.e. for all relations in , there exists a relation in such that for all node pairs in the graph, if there exists an edge then there exists another edge . Invertible relations are useful in determining the inverse of a clause, where the directionality of the clause is flipped along with the ordering of the elements in the conjunctive clause. For example, the inverse of Equation 1 will be of the form:


a.2 Dataset Generation

This section follows up on the discussion in Section 3.3. We describe all the steps involved in the dataset generation process.

Rule Generation. In Algorithm 1, we describe the complete process of generating rules in GraphLog . We require the set of relations, which we use to sample the rule set . We mark some rules as being Invertible Rules (Section A.1). Then, we iterate through all possible combinations of relations in DataLog format to sample possible candidate rules. We impose two constraints on the candidate rule: (i) No two rules in can have the same body. This ensures consistency between the rules. (ii) Candidate rules cannot have common relations among the head and body. This ensures absence of cycles. We also add the inverse rule of our sampled candidate rule and check the same consistencies again. We employ two types of unary Horn clauses to perform the closure of the available rules and to check the consistency of the different rules in . Using this process, we ensure that all generated rules are sound and consistent with respect to .

World Sampling. From the set of rules in , we partition rules into buckets for different worlds (Algorithm 2). We use a simple policy of bucketing via a sliding window of width

with stride

, to classify rules pertaining to each world. For example, two such consecutive worlds can be generated as

and . (Algorithm 2) We randomly permute before bucketing in-order.

Graph Generation. This is a two-step process where first we sample a world graph (Algorithm 3) and then we sample individual graphs from the world graph (Algorithm 4). Given a set of rules , in the first step, we recursively sample and apply rules in to generate a relation graph called world graph. This sampling procedure enables us to create a diverse set of world graphs by considering only certain subsets (of ) during sampling. By controlling the extent of overlap between the subsets of (in terms of the number of rules that are common across the subsets), we can precisely control the similarity between the different world graphs.

In the second step (Algorithm 4), the world graph is used to sample a set of graphs . A graph is sampled from by sampling a pair of nodes from and then by sampling a resolution path . The edge provides the target relation that the learning model has to predict. Since the relation for the edge can be resolved by composing the relations along the resolution path, the relation prediction task tests for the compositional generalization abilities of the models. We first sample all possible resolution paths and get their individual descriptors , which we split in training, validation and test splits. We then construct the training, validation and testing graphs by first adding all edges of an individual to the corresponding graph , and then sampling neighbors of . Concretely, we use Breadth First Search (BFS) to sample the neighboring subgraph of each node

with a decaying selection probability

. This allows us to create diverse input graphs while having precise control over its resolution by its descriptor . Splitting dataset over these descriptor paths ensures inductive generalization.

  Input: Set of relations
  Define an empty rule set
  Populate Invertible Rules, , add to
  for all  do
     for all  do
        for all  do
           Define candidate rule
           if Cyclical rule, i.e. OR  then
              Reject rule
           end if
           if  then
              Add to
              Define inverse rule
              if  then
                 Add to
                 Remove rule having body from
              end if
           end if
        end for
     end for
  end for
  Check and remove any further cyclical rules.
Algorithm 1 Rule Generator
0:  Rule Set
0:  Number of worlds
0:  Number of rules per world
0:  Overlapping increment stride
  for  do
  end for
Algorithm 2 Partition rules into overlapping sets
0:  Set of relations
0:  Set of rules derived from ,
0:  Set rule selection probability gamma
  Set rule selection probability
  Maximum number of expansions
  Set of available nodes , s.t.
  Number of cycles of generation
  Set WorldGraph set of edges
  while  or  do
     Randomly choose an expansion number for this cycle:
     Set added edges for this cycle
     for all step in steps do
        if step = 0 then
           With uniform probability, either:
           Sample from and sample without replacement, OR
           Sample an edge from
           Add to and
           Sample an edge from
        end if
        Sample a rule from following s.t.
        Sample a new node without replacement
        Add edge to and
        Add edge t and
     end for
     if All rules in is used atleast once then
        Increment by 1
        Reset rule selection probability
     end if
  end while
Algorithm 3 World Graph Generator
0:  Rule Set
0:  World Graph
0:  Maximum Expansion length
  Set Descriptor set
  for all  do
     Get all walks such that
     Get all descriptors for all walks
     Add to
  end for
  Set train graph set
  Set test graph set
  Split descriptors in train and test split, and
  for all  or  do
     Set source node and sink node
     Set prediction target
     Set graph edges
     Add all edges from to
     for all  do
        Sample Breadth First Search connected nodes from and with decaying probability
        Add the sampled edges to
     end for
     Remove edges in which create shorter paths between and
     Add to either or
  end for
Algorithm 4 Graph Sampler

a.3 Computing Similarity

GraphLog provides precise control for categorizing the similarity between different worlds by computing the overlap of the underlying rules. Concretely, the similarity between two worlds and is defined as , where and are the graph worlds and and are the set of rules associated with them. Thus GraphLog enables various training scenarios - training on highly similar worlds or training on a mix of similar and dissimilar worlds. This fine grained control allows GraphLog to mimic both in-distribution and out-of-distribution scenarios - during training and testing. It also enables us to precisely categorize the effect of multi-task pre-training when the model needs to adapt to novel worlds.

a.4 Computing difficulty

Recent research in multitask learning has shown evidence that models prioritize selection of difficult tasks over easy tasks while learning to boost the overall performance (Guo et al., 2018). Thus, GraphLog also provides a method to examine how pretraining on tasks of different difficulty level affects the adaptation performance. Due to the stochastic effect of partitioning of the rules, GraphLog consists of datasets with varying range of difficulty. We use the supervised learning scores (Table 6) as a proxy to determine the the relative difficulty of different datasets. We cluster the datasets such that tasks with prediction accuracy greater than or above 70% are labeled as easy difficulty, 50-70% are labeled as medium difficulty and below 50% are labeled as hard difficulty dataset. We find that the labels obtained by this criteria are consistent across the different models (Figure 3).

Appendix B Supervised learning on GraphLog

We perform extensive experiments over all the datasets available in GraphLog (statistics given in Table 6). We observe that in general, for the entire set of 57 worlds, the GAT_E-GAT model performs the best. We observe that the relative difficulty (Section A.4) of the tasks are highly correlated with the number of descriptors (Section A.1) available for each task. This shows that for a learner, a dataset with enough variety among the resolution paths of the graphs is relatively easier to learn compared to the datasets which has less variation.

Appendix C Multitask Learning

c.1 Multitask Learning on different data splits by difficulty

Easy Medium Difficult
Accuracy Accuracy Accuracy
GAT E-GAT 0.729 0.586 0.414
Param E-GAT 0.728 0.574 0.379
GCN E-GAT 0.713 0.55 0.396
GAT RGCN 0.695 0.53 0.421
Param RGCN 0.551 0.457 0.362
GCN RGCN 0.673 0.514 0.396
Table 4: Inductive performance on data splits marked by difficulty

In Section A.4 we introduced the notion of difficulty among the tasks available in GraphLog . Here, we consider a set of experiments where we perform multitask training and inductive testing on the worlds bucketized by their relative difficulty (Table 4). We sample equal number of worlds from each difficulty bucket, and separately perform multitask training and testing. We evaluate the average prediction accuracy on the datasets within each bucket. We observe that the average multitask performance also mimics the relative task difficulty distribution. We find GAT-E-GAT model outperforms other baselines in Easy and Medium setup, but is outperformed by GAT-RGCN model in the Difficult

setup. For each model, we used the same architecture and hyperparameter settings across the buckets. Optimizing individually for each bucket may improve the relative performance.

c.2 Multitask Pre-training by task similarity

Figure 8: We perform fine-grained analysis of few shot adaptation capabilities in Multitask setting. Group 0.0 and 1.0 corresponds to 0% and 100% similarity respectively.

In the main paper (Section 5.2) we introduce the setup of performing multitask pre-training on GraphLog datasets and adaptation on the datasets based on relative similarity. Here, we perform fine-grained analysis of few-shot adapatation capabilities of the models. We analyze the adaptation performance in two settings - when the adaptation dataset has complete overlap of rules with the training datasets (group=1.0) and when the adaptation dataset has zero overlap with the training datasets (group=0.0). We find RGCN family of models with a graph based representation function has faster adaptation on the dissimilar dataset, with GCN-RGCN showing the fastest improvement. However on the similar dataset the models follow the ranking of the supervised learning experiments, with GAT-EGAT model adapting comparitively better.

c.3 Multitask Pre-training by task difficulty

Easy Medium Difficult
Accuracy Accuracy Accuracy
GAT E-GAT 0.531 0.569 0.555
Param E-GAT 0.520 0.548 0.540
GCN E-GAT 0.555 0.561 0.558
GAT RGCN 0.502 0.532 0.532
Param RGCN 0.535 0.506 0.539
GCN RGCN 0.481 0.516 0.520
Mean 0.521 0.540 0.539
Table 5: Convergence performance on 3 held out datasets when pre-trained on easy, medium and hard training datasets
Figure 9: We evaluate the effect of -shot adaptation on held out datasets when pre-trained on easy, medium and hard training datasets, among the different model architectures. Here, ranges from 0 to 40.

Using the notion of difficulty introduced in Section A.4, we perform the suite of experiments to evaluate the effect of pre-training on Easy, Medium and Difficult datasets. Interestingly, we find the performance on convergence is better on Medium and Hard datasets on pre-training, compared to the Easy dataset (Table 5). This behaviour is also mirrored in k-shot adaptation performance (Figure 9), where pre-training on Hard dataset provides faster adaptation performance on 4/6 models.

Appendix D Continual Learning

A natural question arises following our continual learning experiments in Section 5.3 : does the order of difficulty of the worlds matter? Thus, we perform an experiment following Curriculum Learning (Bengio et al., 2009) setup, where the order of the worlds being trained is determined by their relative difficulty (which is determined by the performance of models in supervised learning setup, Table 6, i.e., we order the worlds from easier worlds to harder worlds). We observe that while the current task accuracy follows the trend of the difficulty of the worlds (Figure 10), the mean of past accuracy is significantly worse. This suggests that a curriculum learning strategy might not be optimal to learn graph representations in a continual learning setting. We also performed the same experiment with sharing only the composition and representation functions (Figure 11), and observe similar trends where sharing the representation function reduces the effect of catastrophic forgetting.

Figure 10: Curriculum Learning strategy in Continual Learning setup of GraphLog.
Figure 11: Curriculum Learning strategy in Continual Learning setup of GraphLog, when either the composition function or the representation function is shared for all worlds.
World ID NC ND Split ARL AN AE D M1 M2 M3 M4 M5 M6
rule_0 17 286 train 4.49 15.487 19.295 Hard 0.481 0.500 0.494 0.486 0.462 0.462
rule_1 15 239 train 4.10 11.565 13.615 Hard 0.432 0.411 0.428 0.406 0.400 0.408
rule_2 17 157 train 3.21 9.809 11.165 Hard 0.412 0.357 0.373 0.347 0.347 0.319
rule_3 16 189 train 3.63 11.137 13.273 Hard 0.429 0.404 0.473 0.373 0.401 0.451
rule_4 16 189 train 3.94 12.622 15.501 Medium 0.624 0.606 0.619 0.475 0.481 0.595
rule_5 14 275 train 4.41 14.545 18.872 Hard 0.526 0.539 0.548 0.429 0.461 0.455
rule_6 16 249 train 5.06 16.257 20.164 Hard 0.528 0.514 0.536 0.498 0.495 0.476
rule_7 17 288 train 4.47 13.161 16.333 Medium 0.613 0.558 0.598 0.487 0.486 0.537
rule_8 15 404 train 5.43 15.997 19.134 Medium 0.627 0.643 0.629 0.523 0.563 0.569
rule_9 19 1011 train 7.22 24.151 32.668 Easy 0.758 0.744 0.739 0.683 0.651 0.623
rule_10 18 524 train 5.87 18.011 22.202 Medium 0.656 0.654 0.663 0.596 0.563 0.605
rule_11 17 194 train 4.29 11.459 13.037 Medium 0.552 0.525 0.533 0.445 0.456 0.419
rule_12 15 306 train 4.14 11.238 12.919 Easy 0.771 0.726 0.603 0.511 0.561 0.523
rule_13 16 149 train 3.58 11.238 13.549 Hard 0.453 0.402 0.419 0.347 0.298 0.344
rule_14 16 224 train 4.14 11.371 13.403 Hard 0.448 0.457 0.401 0.314 0.318 0.332
rule_15 14 224 train 3.82 12.661 15.105 Hard 0.494 0.423 0.501 0.402 0.397 0.435
rule_16 16 205 train 3.59 11.345 13.293 Hard 0.318 0.332 0.292 0.328 0.306 0.291
rule_17 17 147 train 3.16 8.163 8.894 Hard 0.347 0.308 0.274 0.164 0.161 0.181
rule_18 18 923 train 6.63 25.035 33.080 Easy 0.700 0.680 0.713 0.650 0.641 0.618
rule_19 16 416 train 6.10 17.180 20.818 Easy 0.790 0.774 0.777 0.731 0.729 0.702
rule_20 20 2024 train 8.63 34.059 45.985 Easy 0.830 0.799 0.854 0.756 0.741 0.750
rule_21 13 272 train 4.58 10.559 11.754 Medium 0.621 0.610 0.632 0.531 0.516 0.580
rule_22 17 422 train 5.21 16.540 20.681 Medium 0.586 0.593 0.628 0.530 0.506 0.573
rule_23 15 383 train 4.97 17.067 21.111 Hard 0.508 0.522 0.493 0.455 0.473 0.476
rule_24 18 879 train 6.33 21.402 26.152 Easy 0.706 0.704 0.743 0.656 0.641 0.638
rule_25 15 278 train 3.84 11.093 12.775 Hard 0.424 0.419 0.382 0.358 0.345 0.412
rule_26 15 352 train 4.71 14.157 17.115 Medium 0.565 0.534 0.532 0.466 0.461 0.499
rule_27 16 393 train 4.98 14.296 16.499 Easy 0.713 0.714 0.722 0.632 0.604 0.647
rule_28 16 391 train 4.82 17.551 21.897 Medium 0.575 0.564 0.571 0.503 0.499 0.552
rule_29 16 144 train 3.87 10.193 11.774 Hard 0.468 0.445 0.475 0.325 0.336 0.389
rule_30 17 177 train 3.51 10.270 11.764 Hard 0.381 0.426 0.382 0.357 0.316 0.336
rule_31 19 916 train 5.90 20.147 26.562 Easy 0.788 0.789 0.770 0.669 0.674 0.641
rule_32 16 287 train 4.66 16.270 20.929 Medium 0.674 0.671 0.700 0.621 0.594 0.615
rule_33 18 312 train 4.50 14.738 18.266 Medium 0.695 0.660 0.709 0.710 0.679 0.668
rule_34 18 504 train 5.00 15.345 18.614 Easy 0.908 0.888 0.906 0.768 0.762 0.811
rule_35 19 979 train 6.23 21.867 28.266 Easy 0.831 0.750 0.782 0.680 0.700 0.662
rule_36 19 252 train 4.66 13.900 16.613 Easy 0.742 0.698 0.698 0.659 0.627 0.651
rule_37 17 260 train 4.00 11.956 14.010 Easy 0.843 0.826 0.826 0.673 0.698 0.716
rule_38 17 568 train 5.21 15.305 20.075 Easy 0.748 0.762 0.733 0.644 0.630 0.719
rule_39 15 182 train 3.98 12.552 14.800 Easy 0.737 0.642 0.635 0.592 0.603 0.587
rule_40 17 181 train 3.69 11.556 14.437 Medium 0.552 0.584 0.575 0.525 0.472 0.479
rule_41 15 113 train 3.58 10.162 11.553 Medium 0.619 0.601 0.626 0.490 0.468 0.470
rule_42 14 95 train 2.96 8.939 9.751 Hard 0.511 0.472 0.483 0.386 0.393 0.395
rule_43 16 162 train 3.36 11.077 13.337 Medium 0.622 0.567 0.579 0.473 0.482 0.437
rule_44 18 705 train 4.75 15.310 18.172 Hard 0.538 0.561 0.603 0.498 0.519 0.450
rule_45 15 151 train 3.39 9.127 10.001 Medium 0.569 0.580 0.592 0.535 0.524 0.524
rule_46 19 2704 train 7.94 31.458 43.489 Easy 0.850 0.820 0.828 0.773 0.762 0.749
rule_47 18 647 train 6.66 22.139 27.789 Easy 0.723 0.667 0.708 0.620 0.649 0.611
rule_48 16 978 train 6.15 17.802 21.674 Easy 0.812 0.798 0.812 0.772 0.763 0.753
rule_49 14 169 train 3.41 9.983 11.177 Easy 0.714 0.734 0.700 0.511 0.491 0.615
rule_50 16 286 train 3.99 12.274 16.117 Medium 0.651 0.653 0.656 0.555 0.583 0.570
rule_51 16 332 valid 4.44 16.384 21.817 Easy 0.746 0.742 0.738 0.667 0.657 0.689
rule_52 17 351 valid 4.81 16.231 20.613 Medium 0.697 0.716 0.754 0.653 0.655 0.670
rule_53 15 165 valid 3.65 10.838 12.378 Hard 0.458 0.464 0.525 0.334 0.364 0.373
rule_54 13 303 test 5.25 13.503 15.567 Medium 0.638 0.623 0.603 0.587 0.586 0.555
rule_55 16 293 test 4.83 16.444 20.944 Medium 0.625 0.582 0.578 0.561 0.528 0.571
rule_56 15 241 test 4.40 14.010 16.702 Medium 0.653 0.681 0.692 0.522 0.513 0.550
AGG 16.33 428.94 4.70 14.89 18.37 0.618 / 26 0.603 / 10 0.611 / 20 0.530 / 1 0.526 / 0 0.539 / 0
Table 6: Results on Single-task supervised setup for all datasets in GraphLog. Abbreviations: NC: Number of Classes, ND: Number of Descriptors, ARL: Average Resolution Length, AN: Average number of nodes, AE: Average number of edges

, D: Difficulty, AGG: Aggregate Statistics. List of models considered : M1: GAT-EGAT, M2: GCN-E-GAT, M3: Param-E-GAT, M4: GAT-RGCN, M5: GCN-RGCN and M6: Param-RGCN. Difficulty is calculated by taking the scores of the model (M1) and partitioning the worlds according to their accuracy ( = Easy, and = Medium, and = Hard). We provide both the mean of the raw accuracy scores for all models, as well as the number of times the model is ranked first in all the tasks.

Appendix E Hyperparameters and Experimental Setup

In this section, we provide detailed hyperparameter settings for both models and dataset generation for the purposes of reproducibility. The codebase and dataset used in the experiments are attached with the Supplementary materials, and will be made public on acceptance.

e.1 Dataset Hyperparams

We generate GraphLog with 20 relations or classes (), which results in 76 rules in after consistency checks. For unary rules, we specify half of the relations to be symmetric and other half to have their invertible relations. To split the rules for individual worlds, we choose the number of rules for each world and stride , and end up with 57 worlds . For each world , we generate 5000 training, 1000 testing and 1000 validation graphs.

e.2 Model Hyperparams

For all models, we perform hyper-parameter sweep (grid search) to find the optimal values based on the validation accuracy. For all models, we use the relation embedding and node embedding to be 200 dimensions. We train all models with Adam optimizer with learning rate 0.001 and weight decay of 0.0001. For supervised setting, we train all models for 500 epochs, and we add a scheduler for learning rate to decay it by 0.8 whenever the validation loss is stagnant for 10 epochs. In multitask setting, we sample a new task every epoch from the list of available tasks. Here, we run all models for 2000 epochs when we have the number of tasks

. For larger number of tasks (Figure 4), we train by proportionally increasing the number of epochs compared to the number of tasks. (2k epochs for 10 tasks, 4k epochs for 20 tasks, 6k epochs for 30 tasks, 8k epochs for 40 tasks and 10k epochs for 50 tasks). For continual learning experiment, we train each task for 100 epochs for all models. No learning rate scheduling is used for either multitask or continual learning experiments. Individual model hyper-parameters are as follows:

  • Representation functions :

    • GAT : Number of layers = 2, Number of attention heads = 2, Dropout = 0.4

    • GCN : Number of layers = 2, with symmetric normalization and bias, no dropout

  • Composition functions:

    • E-GAT: Number of layers = 6, Number of attention heads = 2, Dropout = 0.4

    • RGCN: Number of layers = 2, no dropout, with bias.