Relational reasoning, or the ability to reason about the relationship between objects entities in the environment, is considered a fundamental aspect of intelligence (Krawczyk et al., 2011; Halford et al., 2010). Relational reasoning is known to play a critical role in cognitive growth of children (Son et al., 2011; Farrington-Flint et al., 2007; Richland et al., 2010). This ability to infer relations between objects/entities/situations, and to compose relations into higher-order relations, is one of the reasons why humans quickly learn how to solve new tasks (Holyoak and Morrison, 2012; Alexander, 2016).
The perceived importance of relational reasoning for generalization capabilities has fueled the development of several neural network architectures that incorporate relational inductive biases (Battaglia et al., 2016; Santoro et al., 2017; Battaglia et al., 2018). Graph neural networks (GNNs), in particular, have emerged as a dominant computational paradigm within this growing area (Scarselli et al., 2008; Hamilton et al., 2017a; Gilmer et al., 2017; Schlichtkrull et al., 2018; Du et al., 2019). However, despite the growing interest in GNNs and their promise for improving the generalization capabilities of neural networks, we currently lack an understanding of how effectively these models can adapt and generalize across distinct tasks.
In this work, we study the task of logical generalization, in the context of relational reasoning using GNNs. In particular, we study how GNNs can induce logical rules and generalize by combining these rules in novel ways after training. We propose a benchmark suite, GraphLog, that is grounded in first-order logic. Figure 1 shows the setup of the benchmark. Given a set of logical rules, we create different logical worlds with overlapping rules. For each world (say ), we sample multiple knowledge graphs (say ). The learning agent should learn to induce the logical rules for predicting the missing facts in these knowledge graphs. Using our benchmark, we evaluate the generalization capabilities of GNNs in a supervised setting by predicting unseen combinations of known rules within a specific logical world. This task that explicitly requires inductive generalization. We further analyze how various GNN architectures perform in the multi-task and the continual learning scenarios, where they have to learn over a set of logical worlds with different underlying logic. Our setup allows us to control the similarity between the different worlds by controlling the overlap in logical rules between different worlds. This enables us to precisely analyze how task similarity impacts performance in the multi-task setting.
Our analysis provides the following useful insights regarding the logical generalization capabilities of GNNs:
Two architecture choices for GNNs have a strong positive impact on the generalization performance: 1) incorporating multi-relational edge features using attention, and 2) explicitly modularising the GNN architecture to include a parametric representation function, which learns representations for the relations based on the knowledge graph structure.
In the multi-task setting, training a model on a more diverse set of logical worlds improves generalization and adaptation performance.
All the evaluated models exhibit catastrophic forgetting in the continual learning setting. This indicates that the models are prone to fitting to just the current task at hand and not learning representations and compositions that can transfer across tasks—highlighting the challenge of lifelong learning in the context of logical generalization and GNNs.
2 Background and Related Work
Graph Neural Networks. Several graph neural network (GNN) architectures have been proposed to learn the representation for the graph input (Scarselli et al., 2008; Duvenaud et al., 2015; Defferrard et al., 2016; Kipf and Welling, 2016; Gilmer et al., 2017; Veličković et al., 2017; Hamilton et al., 2017b; Schlichtkrull et al., 2018). Previous works have focused on evaluating graph neural networks in terms of their expressive power (Morris et al., 2019; Xu et al., 2018), usefulness of features (Chen et al., 2019), and explaining the predictions from GNNs (Ying et al., 2019). Complementing these works, we evaluate GNN models on the task of logical generalization.
Knowledge graph completion. Many knowledge graph datasets are available for the task of relation prediction (also known as knowledge base completion). Prominent examples include Freebase15K (Bordes et al., 2013), WordNet (Miller, 1995), NELL (Mitchell and Fredkin, 2014), and YAGO (Suchanek et al., 2007; Hoffart et al., 2011; Mahdisoltani et al., 2013). These datasets are derived from real-world knowledge graphs and are useful for empirical evaluation of relation prediction systems. However, these datasets are generally noisy and incomplete, as many facts are not available in the underlying knowledge bases (West et al., 2014; Paulheim, 2017). Moreover, the logical rules underpinning these systems are often opaque and implicit (Guo et al., 2016). All these shortcomings reduce the usefulness of existing knowledge graph datasets for understanding the logical generalization capability of neural networks. Some of these limitations can be overcome by using synthetic datasets, which can provide a high degree of control and flexibility over the data generation process at a low cost. Synthetic datasets are useful for understanding the behavior of different models - especially when the underlying problem can have many factors of variations. We consider using synthetic datasets, as a means and not an end, to understand the logical generalization capability of GNNs.
Our GraphLog benchmark serves as a synthetic complement to the real-world datasets. Instead of sampling from a real-world knowledge base, we create synthetic knowledge graphs that are governed by a known and inspectable set of logical rules. Moreover, the relations in GraphLog are self-contained and do not require any common-sense knowledge, thus making the tasks self-contained.
Procedurally generated datasets for reasoning. In recent years, several procedurally generated benchmarks have been proposed to study the relational reasoning and compositional generalization properties of neural networks. Some recent and prominent examples are listed in Table 1. These datasets aim to provide a controlled testbed for evaluating the compositional reasoning capabilities of neural networks in isolation. Based on these existing works and their insightful observations, we enumerate the four key desiderata that, we believe, such a benchmark should provide:
[itemsep=1pt,topsep=0pt, parsep=0pt, leftmargin=*]
Interpretable Rules: The rules that are used to procedurally generate the dataset should be human interpretable.
Diversity: The benchmark datasets should have enough diversity across different tasks, and the compositional rules used to solve different tasks should be distinct, so that adaptation on a novel task is not trivial. The degree of similarity across the tasks should be configurable to enable evaluating the role of diversity in generalization.
Compositional generalization: The benchmark should require compositional generalization, i.e., generalization to unseen combinations of rules.
Number of tasks: The benchmark should support creating a large number of tasks. This enables a more fine-grained inspection of the generalization capabilities of the model in different setups, e.g., supervised learning, multitask learning, and continual learning.
As shown in Table 1, GraphLog is unique in satisfying all of these desiderata. We highlight that GraphLog is the only dataset specifically designed to test logical generalization capabilities on graph data, whereas previous works have largely focused on the image and text modalities.
|CLEVR (Johnson et al., 2017)||✓||✗||✗||Vision||✓||✗||✗||✗|
|CoGenT (Johnson et al., 2017)||✓||✗||✓||Vision||✓||✗||✗||✗|
|CLUTRR (Sinha et al., 2019)||✓||✗||✓||Text||✓||✗||✗||✗|
|SCAN (Lake and Baroni, 2017)||✓||✗||✓||Text||✓||✓||✗||✗|
|SQoOP (Bahdanau et al., 2018)||✓||✗||✓||Vision||✓||✗||✗||✗|
|TextWorld (Côté et al., 2018)||✗||✓||✓||Text||✓||✓||✓||✓|
A graph is a collection of a set of nodes and a set of edges between the nodes. In this work, we assume that each pair of nodes have at most one edge between them. A relational graph is a graph where the edge between two nodes (say and ) is assigned a label, denoted . The labeled edge is denoted as . A relation set is a set of relations , , … . A rule set is a set of rules in first order logic, which we denote in the Datalog format (Evans and Grefenstette, 2017), , and which can be expanded as Horn clauses of the form:
where denotes a variable that can be bound to any entity and denotes logical implication. The relations form the body while the relation forms the head of the rule. Horn clauses of this form represent a well-defined subset of first-order logic, and they encompass the types of logical rules learned by the vast majority of existing rule induction engines for knowledge graphs (Langley and Simon, 1995).
We use to denote a path from node to in a graph . We construct graphs according to rules of the form in Equation 1 so that a path between two nodes will always imply a specific relation between these two nodes. In other words, we will always have that
Thus, following the path between two nodes, and applying the propositional rules along the edges of the path, we can resolve the relationship between the nodes. Hence, we refer to the paths as resolution paths. The edges of the resolution path are concatenated together to obtain a descriptor. These descriptors are used for quantifying the similarity between different resolution paths, with a higher overlap between the descriptors implying a greater similarity between two resolution paths.
3.2 Problem Setup
We formulate the relational reasoning task as predicting relations between the nodes in a relational graph. Given a query where , the learner has to predict the relation for the edge . Unlike the previous work on knowledge graph completion, we emphasize an inductive problem setup, where the graph in each query is unique. Rather than reasoning on a single static knowledge graph during training and testing, we consider the setting where the model must learn to generalize to unseen graphs during evaluation.
3.3 Dataset Generation
As discussed in Section 2, we want our proposed benchmark to provide four key desiderata: (i) interpretable rules, (ii) diversity, (iii) compositional generalization and (iv) large number of tasks. We describe how our dataset generation process ensures all four aspects.
Rule generation. We create a set of relations and use it to sample a rule set . We impose two constraints on : (i) No two rules in can have the same body. This ensures consistency between the rules. (ii) Rules cannot have common relations among the head and body. This ensures the absence of cyclic dependencies in rules (Hamilton et al., 2018). Generating the dataset using a consistent and well-defined rule set ensures interpretability in the resulting dataset. The full algorithm for rule generation is given in Appendix (Algorithm 1).
Graph generation. The graph generation process has two steps: In the first step, we recursively sample and use rules in to generate a relational graph called the WorldGraph (as shown in Figure 1). This sampling procedure enables us to create a diverse set of WorldGraphs by considering only certain subsets (of ) during sampling. By controlling the extent of overlap between the subsets of (in terms of the number of rules that are common across the subsets), we can precisely control the similarity between the different WorldGraphs. The full algorithm for generating the WorldGraph and controlling the similarity between the worlds is given in Appendix (Algorithm 3 and Section A.2).
In the second step, the WorldGraph is used to sample a set of graphs (shown as Step (a) in Figure 2). A graph is sampled from by sampling a pair of nodes from and then by sampling a resolution path . The edge between the source and sink node of the path provides the target relation for the learning model to predict. To increase the complexity of the sampled graphs (beyond being simple paths), we also add nodes to by sampling neighbors of the nodes on , such that no other shortest path exists between and . Algorithm 4 (in the Appendix) details our graph sampling approach.
3.4 Summary of the GraphLog Dataset
We use the data generation process described in Section 3.3 to instantiate a dataset suite with 57 distinct logical worlds and graphs per world (Figure 1). The dataset is divided into the sets of training, validation, and testing worlds. The graphs within each world are also split into training, validation, and testing sets. The key statistics of the datasets are given in Table 2. Though we instantiate 57 worlds, the GraphLog code can instantiate an arbitrary number of worlds and has been included in the supplementary material.
3.4.1 Setups supported in GraphLog
GraphLog enables us to investigate the logical relational reasoning performance of models in the following setups:
|Number of relations||20|
|Total number of WorldGraphs||57|
|Total number of unique rules||76|
|Training Graphs per WorldGraph||5000|
|Validation Graphs per WorldGraph||1000|
|Testing Graphs per WorldGraph||1000|
|Number of rules per WorldGraph||20|
|Average number of descriptors||522|
|Maximum length of resolution path||10|
|Minimum length of resolution path||2|
Supervised learning. In the supervised learning setup, a model is trained on the train split of a single logical world and evaluated on the test split of the same world. The total number of rules grows exponentially with the number of relations , making it impossible to train on all possible combinations of the relations. However, we expect that a perfectly systematic model generalizes to unseen combinations of relations by training only on a subset of combinations (i.e., via inductive reasoning).
Multi-task learning. GraphLog provides multiple logical worlds, each with its own training and evaluation splits. In the standard multi-task training, the model is trained on the train split of many worlds ( and evaluated on the test split of the same worlds. The complexity of each world and the similarity between the different worlds can be precisely controlled. GraphLog thus enables us to evaluate how model performance varies when the model is trained on similar vs. dissimilar worlds.
GraphLog is also designed to study the effect of pre-training on adaptation. In this setup, the model is first pre-trained on the train split of multiple worlds () and then adapted (fine-tuned) on the train split of the unseen heldout worlds (. The model is evaluated on the test split of the novel worlds. Similar to the previous setup, GraphLog provides us an opportunity to investigate the effect of similarity in pre-training. This enables GraphLog to mimic in-distribution and out-of-distribution training and testing scenarios, as well as precisely categorize the effect of multi-task pre-training for adaptation performance.
Continual learning. GraphLog provides access to a large number of worlds, enabling us to evaluate the logical generalization capability of the models in the continual learning setup. In this setup, the model is trained on a sequence of worlds. Before training on a new world, the model is evaluated on all the worlds that the model has trained on so far. Given the several challenges involved in continual learning (Thrun and Pratt, 2012; Parisi et al., 2019; De Lange et al., 2019; Sodhani et al., 2019), we do not expect the models to be able to remember the knowledge from all the previous tasks. Nonetheless, given that we are evaluating the models for relational reasoning and that our datasets share relations, we would expect the models to retain some knowledge of how to solve the previous tasks. In this sense, the performance on the previous tasks can also be seen as an indicator of if the models actually learn to solve the relational reasoning tasks or they just fit to the current dataset distribution.
4 Representation and Composition
In this section, we describe the graph neural network (GNN) architectures that we evaluate on the GraphLog benchmark. In order to perform well on the benchmark tasks, a model should learn representations that are useful for solving the tasks in the current world while being general enough to be effectively adapted to the new worlds. To this end, we structure the GNN models we analyze around two key modules:
Representation module: This module is represented as a function , which maps logical relations within a particular world to
-dimensional vector representations. Intuitively, this function should learn how to encode thesemantics of the various relations within a logical world.
Composition module: This module is a function , which learns how to compose the relation representations learned by to make predictions about queries over a knowledge graph.
Note that though we break down the process into two steps, in practice, the learner does not have access to the correct representations of relations or to . The learner has to rely only on the target labels to solve the reasoning task. We hypothesize that this separation of concerns between a representation function and a composition function (Dijkstra, 1982) could provide a useful inductive bias for the model.
4.1 Representation modules
We first describe the different approaches for learning the representation for the relations. These representations will be provided as input to the composition function.
Direct parameterization. The simplest approach to define the representation module is to train unique embeddings for each relation . This approach is predominantly used in the previous work on GNNs (Gilmer et al., 2017; Veličković et al., 2017), and we term this approach as the Param representation module. A major limitation of this approach is that the relation representations are optimized specifically for each logical world, and there is no inductive bias towards learning representations that can generalize.
Learning representations from the graph structure. In order to define a more powerful and expressive representation function, we consider an approach that learns relation representations as a function of the WorldGraph underlying a logical world. To do so, we consider an “extended” form of the WorldGraph, , where introduce new nodes (called edge-nodes) corresponding to each edge in the original WorldGraph . For an edge , the corresponding edge-node is connected to only those nodes that were incident to it in the original graph (i.e. nodes and ; see Figure 2, Step (b)). This new graph only has one type of edge and comprises of nodes from both the original graph and from the set of edge-nodes.
We learn the relation representations by training a GNN model on the expanded WorldGraph and by averaging the edge-node embeddings corresponding to each relation type . (Step (c) in Figure 2). For the GNN model, we consider the Graph Convolutional Network (GCN) (Kipf and Welling, 2016) and the Graph Attention Network (GAT) architectures. Since the nodes do not have any features or attributes, we randomly initialize the embeddings in these GNN message passing layers.
The intuition behind creating the extended-graph is that the representation GNN function can learn the relation embeddings based on the structure of the complete relational graph . We expect this to provide an inductive bias that can generalize more effectively than the simple Param approach. Finally, note that while the representation function is given access to the WorldGraph to learn representations for relations, the composition module is not able to interface with the WorldGraph in order to make predictions about a query.
4.2 Composition modules
We now describe the GNNs used for the composition modules. These models take as input the query and the relation embedding (Step (d) and (e) in Figure 2).
Relational Graph Convolutional Network (RGCN). Given that the input to the composition module is a relational graph, the RGCN model (Schlichtkrull et al., 2018) is a natural choice for a baseline architecture. In this approach, we iterate a series of message passing operations:
where denotes the representation for a node at the layer of the model,
is a learnable tensor,is the representation for relation , and denotes the neighbors of node by relation . We use to denote multiplication across a particular mode of the tensor. This RGCN model learns a relation-specific propagation matrix, specified by the interaction between the relation embedding and the shared tensor .111Note that the shared tensor is equivalent to the basis matrix formulation in Schlichtkrull et al. (2018).
Edge-based Graph Attention Network (Edge-GAT). In addition to the RGCN model—which is considered the defacto standard architecture for applying GNNs to multi-relational data—we also explore an extension of the Graph Attention Network (GAT) model (Veličković et al., 2017) to handle edge types. Many recent works have highlighted the importance of the attention mechanism, especially in the context of relational reasoning (Vaswani et al., 2017; Santoro et al., 2018; Schlag et al., 2019). Motivated by this observation, we investigate an extended version of the GAT, where we incorporate gating via an LSTM (Hochreiter and Schmidhuber, 1997) and where the attention is conditioned on both the incoming message (from the other nodes) and the relation embedding (of the other nodes):
Following the original GAT model, the attention function is defined using an dense neural network on the concatenation of the input vectors. We refer to this model as the Edge GAT (E-GAT) model.
Query and node representations. We predict the relation for a given query by concatenating (the final-layer query node embeddings, assuming a -layer GNN) and applying a two-layer dense neural network (Step (f) in Figure 2). The entire model (i.e., the representation function and the composition function) are trained end-to-end using the softmax cross-entropy loss. Since we have no node features, we randomly initialize all the node embeddings in the GNNs (i.e., ).
We aim to quantify the performance of the different GNN models on the task of logical relation reasoning, in three contexts: (i) Single Task Supervised Learning, (ii) Multi-Task Training and (iii) Continual Learning. Our experiments use the GraphLog benchmark with distinct 57 worlds or knowledge graph datasets (see Section 3) and 6 different different GNN models (see Section 4
). In the main paper, we share the key trends and observations that hold across the different combinations of the models and the datasets, along with some representative results. The full set of results is provided in the Appendix. All the models are implemented using PyTorch 1.3.1(Paszke et al., 2019). The code has been included with the supplemental material.
5.1 Single Task Supervised Learning
In our first setup, we train and evaluate all of the models on all the 57 worlds, one model, and one world pair at a time. This experiment provides several important results. Previous works considered only a handful of datasets when evaluating the different models on the task of relational reasoning. As such, it is possible to design a model that can exploit the biases present in the few datasets that the model is being evaluated over. In our case, we consider over 50 datasets, with different characteristics (Table 2). It is difficult for one model to outperform the other models on all the datasets just by exploiting some dataset-specific bias, thereby making the conclusions more robust.
In Figure 3, we present the results for the different models. We categorize the worlds in three categories of difficulty – easy, moderate and difficult – based on relative test performance of the models on each world. Table 6 (in Appendix) contains the results for the different models on the individual worlds. We observe that the models using E-GAT as the composition functions always outperform their counterparts using the RGCN models. This confirms our hypothesis about the usefulness of combining relational reasoning and attention for improving the performance on relational reasoning tasks. An interesting observation is that the relative ordering among the worlds, in terms of the test accuracy of the different models, is consistent irrespective of the model we use, highlighting the intrinsic difficulty of the different worlds in GraphLog.
5.2 Multi-Task Training
We now turn to the setting of multi-task learning where we train the same model on multiple logical worlds.
Basic multi-task training. First, we evaluate a how changing the similarity among the training worlds affects the test performance in the multi-task setup, where a model is trained jointly on eight and tested on three distinct worlds. In Table 3, we observe that considering a mix of similar and dissimilar worlds improves the generalization capabilities of all the models when evaluated on the test split. Another important observation is that just like the supervised learning setup, the GAT-EGAT model consistently performs either as good as or better than other models and the models using EGAT for the composition function perform better than the ones using the RGCN model. Figure 4 shows how the performance of the various models changes when we perform multi-task training on an increasingly large set of worlds. Interestingly, we see that model performance improves as the number of worlds is increased from 10 to 20 but then begins to decline, indicating capacity saturation in the presence of too many diverse worlds.
Multi-task pre-training. In this setup, we pre-train the model on multiple worlds and adapt on a heldout world. We study how the models’ adaption capabilities vary as the similarity between the training and the evaluation distributions changes. Figure 5 considers the case of zero-shot adaptation and adaptation till convergence. As we move along the x-axis, the zero-shot performance (shown with solid colors) decreases in all the setups. This is expected as the similarity between the training and the evaluation distributions also decreases. An interesting trend is that the model’s performance, after adaptation, increases as the similarity between the two distributions decreases. This suggests that training over a diverse set of distributions improves adaptation capability. The results for adaptation with 5, 10, … 30 steps are provided in the Appendix (Figure 8).
5.3 Continual Learning Setup
In the continual learning setup, we evaluate the knowledge retention capabilities of the GNN models. We train the model on a sequence of overlapping worlds, and after converging on every world, we report the average of model’s performance on all the previous worlds. In Figure 6 we observe that as the model is trained on different worlds, the performance on the previous worlds degrades rapidly. This highlights that the current reasoning models are not suitable for continual learning.
The role of the representation function. We also investigate the model’s performance in a continual learning setup where the model learns only a world-specific representation function or a world-specific composition function, and where the other module is shared across the worlds. In Figure 7, we observe that sharing the representation function reduces the effect of catastrophic forgetting, but sharing the composition function does not have the same effect. This suggests that the representation function learns representations that are useful across the worlds.
6 Discussion & Conclusion
In this work, we propose GraphLog, a benchmark suite for evaluating the logical generalization capabilities of Graph Neural Networks. GraphLog is grounded in first-order logic and provides access to a large number of diverse tasks that require compositional generalization to solve, including single task supervised learning, multi-task learning, and continual learning. Our results highlight the importance of attention mechanisms and modularity to achieve logical generalization, while also highlighting open challenges related to multi-task and continual learning in the context of GNNs. A natural direction for future work is leveraging GraphLog for studies of fast adaptation and meta-learning in the context of logical reasoning (e.g., via gradient-based meta learning), as well as integrating state-of-the-art methods (e.g., regularization techniques) to combat catastrophic forgetting in the context of GNNs.
- Relational thinking and relational reasoning: harnessing the power of patterning. NPJ science of learning 1 (1), pp. 1–7. Cited by: §1.
- Systematic generalization: what is required and can it be learned?. arXiv preprint arXiv:1811.12889. Cited by: Table 1.
- Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502–4510. Cited by: §1.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: Appendix D.
- Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §2.
- Are powerful graph neural nets necessary? a dissection on graph classification. arXiv preprint arXiv:1905.04579. Cited by: §2.
- Textworld: a learning environment for text-based games. In Workshop on Computer Games, pp. 41–75. Cited by: Table 1.
- Continual learning: a comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383. Cited by: §3.4.1.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.
- On the role of scientific thought. In Selected writings on computing: a personal perspective, pp. 60–66. Cited by: §4.
- Graph neural tangent kernel: fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems, pp. 5724–5734. Cited by: §1.
- Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.
- Learning Explanatory Rules from Noisy Data. Cited by: §3.1.
- The role of relational reasoning in children’s addition concepts. British Journal of Developmental Psychology 25 (2), pp. 227–246. Cited by: §1.
- Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1, §2, §4.1.
Dynamic task prioritization for multitask learning.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–287. Cited by: §A.4.
Jointly embedding knowledge graphs and logical rules.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 192–202. Cited by: §2.
- Relational knowledge: the foundation of higher cognition. Trends in cognitive sciences 14 (11), pp. 497–505. Cited by: §1.
- Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §1.
- Embedding logical queries on knowledge graphs. In Advances in Neural Information Processing Systems 31, pp. 2026–2037. Cited by: §3.3.
- Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
- YAGO2: exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th international conference companion on World wide web, pp. 229–232. Cited by: §2.
- The oxford handbook of thinking and reasoning. Oxford University Press. Cited by: §1.
Clevr: a diagnostic dataset for compositional language and elementary visual reasoning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: Table 1.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.1.
- A hierarchy for relational reasoning in the prefrontal cortex. Cortex 47 (5), pp. 588–597. Cited by: §1.
- Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: Table 1.
- Applications of machine learning and rule induction. Communications of the ACM 38 (11), pp. 54–64. Cited by: §3.1.
- Yago3: a knowledge base from multilingual wikipedias. Cited by: §2.
- WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §2.
- Never ending language learning. In Big Data (Big Data), 2014 IEEE International Conference on, pp. 1–1. Cited by: §2.
- Weisfeiler and Leman go neural: higher-order graph neural networks. In AAAI, Cited by: §2.
- Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §3.4.1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §5.
- Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic web 8 (3), pp. 489–508. Cited by: §2.
- Young children’s analogical reasoning across cultures: similarities and differences. Journal of Experimental Child Psychology 105 (1-2), pp. 146–153. Cited by: §1.
Relational recurrent neural networks. In Advances in neural information processing systems, pp. 7299–7310. Cited by: §4.2.
- A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §1.
- The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1, §2.
- Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611. Cited by: §4.2.
- Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §1, §2, §4.2, footnote 1.
- Clutrr: a diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177. Cited by: Table 1.
- On training recurrent neural networks for lifelong learning. Cited by: §3.4.1.
- Connecting instances to promote children’s relational reasoning. Journal of experimental child psychology 108 (2), pp. 260–277. Cited by: §1.
- Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp. 697–706. Cited by: §2.
- Learning to learn. Springer Science & Business Media. Cited by: §3.4.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.2.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2, §4.1, §4.2.
- Knowledge base completion via search-based question answering. In Proceedings of the 23rd international conference on World wide web, pp. 515–526. Cited by: §2.
- How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
- Gnnexplainer: generating explanations for graph neural networks. In Advances in Neural Information Processing Systems, pp. 9240–9251. Cited by: §2.
Appendix A GraphLog
a.1 Extended Terminology
In this section, we extend the terminology introduced in Section 3.1. A set of relations is said to be Invertible if
i.e. for all relations in , there exists a relation in such that for all node pairs in the graph, if there exists an edge then there exists another edge . Invertible relations are useful in determining the inverse of a clause, where the directionality of the clause is flipped along with the ordering of the elements in the conjunctive clause. For example, the inverse of Equation 1 will be of the form:
a.2 Dataset Generation
This section follows up on the discussion in Section 3.3. We describe all the steps involved in the dataset generation process.
Rule Generation. In Algorithm 1, we describe the complete process of generating rules in GraphLog . We require the set of relations, which we use to sample the rule set . We mark some rules as being Invertible Rules (Section A.1). Then, we iterate through all possible combinations of relations in DataLog format to sample possible candidate rules. We impose two constraints on the candidate rule: (i) No two rules in can have the same body. This ensures consistency between the rules. (ii) Candidate rules cannot have common relations among the head and body. This ensures absence of cycles. We also add the inverse rule of our sampled candidate rule and check the same consistencies again. We employ two types of unary Horn clauses to perform the closure of the available rules and to check the consistency of the different rules in . Using this process, we ensure that all generated rules are sound and consistent with respect to .
World Sampling. From the set of rules in , we partition rules into buckets for different worlds (Algorithm 2). We use a simple policy of bucketing via a sliding window of width
, to classify rules pertaining to each world. For example, two such consecutive worlds can be generated asand . (Algorithm 2) We randomly permute before bucketing in-order.
Graph Generation. This is a two-step process where first we sample a world graph (Algorithm 3) and then we sample individual graphs from the world graph (Algorithm 4). Given a set of rules , in the first step, we recursively sample and apply rules in to generate a relation graph called world graph. This sampling procedure enables us to create a diverse set of world graphs by considering only certain subsets (of ) during sampling. By controlling the extent of overlap between the subsets of (in terms of the number of rules that are common across the subsets), we can precisely control the similarity between the different world graphs.
In the second step (Algorithm 4), the world graph is used to sample a set of graphs . A graph is sampled from by sampling a pair of nodes from and then by sampling a resolution path . The edge provides the target relation that the learning model has to predict. Since the relation for the edge can be resolved by composing the relations along the resolution path, the relation prediction task tests for the compositional generalization abilities of the models. We first sample all possible resolution paths and get their individual descriptors , which we split in training, validation and test splits. We then construct the training, validation and testing graphs by first adding all edges of an individual to the corresponding graph , and then sampling neighbors of . Concretely, we use Breadth First Search (BFS) to sample the neighboring subgraph of each node
with a decaying selection probability. This allows us to create diverse input graphs while having precise control over its resolution by its descriptor . Splitting dataset over these descriptor paths ensures inductive generalization.
a.3 Computing Similarity
GraphLog provides precise control for categorizing the similarity between different worlds by computing the overlap of the underlying rules. Concretely, the similarity between two worlds and is defined as , where and are the graph worlds and and are the set of rules associated with them. Thus GraphLog enables various training scenarios - training on highly similar worlds or training on a mix of similar and dissimilar worlds. This fine grained control allows GraphLog to mimic both in-distribution and out-of-distribution scenarios - during training and testing. It also enables us to precisely categorize the effect of multi-task pre-training when the model needs to adapt to novel worlds.
a.4 Computing difficulty
Recent research in multitask learning has shown evidence that models prioritize selection of difficult tasks over easy tasks while learning to boost the overall performance (Guo et al., 2018). Thus, GraphLog also provides a method to examine how pretraining on tasks of different difficulty level affects the adaptation performance. Due to the stochastic effect of partitioning of the rules, GraphLog consists of datasets with varying range of difficulty. We use the supervised learning scores (Table 6) as a proxy to determine the the relative difficulty of different datasets. We cluster the datasets such that tasks with prediction accuracy greater than or above 70% are labeled as easy difficulty, 50-70% are labeled as medium difficulty and below 50% are labeled as hard difficulty dataset. We find that the labels obtained by this criteria are consistent across the different models (Figure 3).
Appendix B Supervised learning on GraphLog
We perform extensive experiments over all the datasets available in GraphLog (statistics given in Table 6). We observe that in general, for the entire set of 57 worlds, the GAT_E-GAT model performs the best. We observe that the relative difficulty (Section A.4) of the tasks are highly correlated with the number of descriptors (Section A.1) available for each task. This shows that for a learner, a dataset with enough variety among the resolution paths of the graphs is relatively easier to learn compared to the datasets which has less variation.
Appendix C Multitask Learning
c.1 Multitask Learning on different data splits by difficulty
In Section A.4 we introduced the notion of difficulty among the tasks available in GraphLog . Here, we consider a set of experiments where we perform multitask training and inductive testing on the worlds bucketized by their relative difficulty (Table 4). We sample equal number of worlds from each difficulty bucket, and separately perform multitask training and testing. We evaluate the average prediction accuracy on the datasets within each bucket. We observe that the average multitask performance also mimics the relative task difficulty distribution. We find GAT-E-GAT model outperforms other baselines in Easy and Medium setup, but is outperformed by GAT-RGCN model in the Difficult
setup. For each model, we used the same architecture and hyperparameter settings across the buckets. Optimizing individually for each bucket may improve the relative performance.
c.2 Multitask Pre-training by task similarity
In the main paper (Section 5.2) we introduce the setup of performing multitask pre-training on GraphLog datasets and adaptation on the datasets based on relative similarity. Here, we perform fine-grained analysis of few-shot adapatation capabilities of the models. We analyze the adaptation performance in two settings - when the adaptation dataset has complete overlap of rules with the training datasets (group=1.0) and when the adaptation dataset has zero overlap with the training datasets (group=0.0). We find RGCN family of models with a graph based representation function has faster adaptation on the dissimilar dataset, with GCN-RGCN showing the fastest improvement. However on the similar dataset the models follow the ranking of the supervised learning experiments, with GAT-EGAT model adapting comparitively better.
c.3 Multitask Pre-training by task difficulty
Using the notion of difficulty introduced in Section A.4, we perform the suite of experiments to evaluate the effect of pre-training on Easy, Medium and Difficult datasets. Interestingly, we find the performance on convergence is better on Medium and Hard datasets on pre-training, compared to the Easy dataset (Table 5). This behaviour is also mirrored in k-shot adaptation performance (Figure 9), where pre-training on Hard dataset provides faster adaptation performance on 4/6 models.
Appendix D Continual Learning
A natural question arises following our continual learning experiments in Section 5.3 : does the order of difficulty of the worlds matter? Thus, we perform an experiment following Curriculum Learning (Bengio et al., 2009) setup, where the order of the worlds being trained is determined by their relative difficulty (which is determined by the performance of models in supervised learning setup, Table 6, i.e., we order the worlds from easier worlds to harder worlds). We observe that while the current task accuracy follows the trend of the difficulty of the worlds (Figure 10), the mean of past accuracy is significantly worse. This suggests that a curriculum learning strategy might not be optimal to learn graph representations in a continual learning setting. We also performed the same experiment with sharing only the composition and representation functions (Figure 11), and observe similar trends where sharing the representation function reduces the effect of catastrophic forgetting.
|AGG||16.33||428.94||4.70||14.89||18.37||0.618 / 26||0.603 / 10||0.611 / 20||0.530 / 1||0.526 / 0||0.539 / 0|
, D: Difficulty, AGG: Aggregate Statistics. List of models considered : M1: GAT-EGAT, M2: GCN-E-GAT, M3: Param-E-GAT, M4: GAT-RGCN, M5: GCN-RGCN and M6: Param-RGCN. Difficulty is calculated by taking the scores of the model (M1) and partitioning the worlds according to their accuracy ( = Easy, and = Medium, and = Hard). We provide both the mean of the raw accuracy scores for all models, as well as the number of times the model is ranked first in all the tasks.
Appendix E Hyperparameters and Experimental Setup
In this section, we provide detailed hyperparameter settings for both models and dataset generation for the purposes of reproducibility. The codebase and dataset used in the experiments are attached with the Supplementary materials, and will be made public on acceptance.
e.1 Dataset Hyperparams
We generate GraphLog with 20 relations or classes (), which results in 76 rules in after consistency checks. For unary rules, we specify half of the relations to be symmetric and other half to have their invertible relations. To split the rules for individual worlds, we choose the number of rules for each world and stride , and end up with 57 worlds . For each world , we generate 5000 training, 1000 testing and 1000 validation graphs.
e.2 Model Hyperparams
For all models, we perform hyper-parameter sweep (grid search) to find the optimal values based on the validation accuracy. For all models, we use the relation embedding and node embedding to be 200 dimensions. We train all models with Adam optimizer with learning rate 0.001 and weight decay of 0.0001. For supervised setting, we train all models for 500 epochs, and we add a scheduler for learning rate to decay it by 0.8 whenever the validation loss is stagnant for 10 epochs. In multitask setting, we sample a new task every epoch from the list of available tasks. Here, we run all models for 2000 epochs when we have the number of tasks. For larger number of tasks (Figure 4), we train by proportionally increasing the number of epochs compared to the number of tasks. (2k epochs for 10 tasks, 4k epochs for 20 tasks, 6k epochs for 30 tasks, 8k epochs for 40 tasks and 10k epochs for 50 tasks). For continual learning experiment, we train each task for 100 epochs for all models. No learning rate scheduling is used for either multitask or continual learning experiments. Individual model hyper-parameters are as follows:
Representation functions :
GAT : Number of layers = 2, Number of attention heads = 2, Dropout = 0.4
GCN : Number of layers = 2, with symmetric normalization and bias, no dropout
E-GAT: Number of layers = 6, Number of attention heads = 2, Dropout = 0.4
RGCN: Number of layers = 2, no dropout, with bias.