DeepAI
Log In Sign Up

ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy

07/04/2022
by   Daniel Zeng, et al.
6

Visual relations form the basis of understanding our compositional world, as relationships between visual objects capture key information in a scene. It is then advantageous to learn relations automatically from the data, as learning with predefined labels cannot capture all possible relations. However, current relation learning methods typically require supervision, and are not designed to generalize to scenes with more complicated relational structures than those seen during training. Here, we introduce ViRel, a method for unsupervised discovery and learning of Visual Relations with graph-level analogy. In a setting where scenes within a task share the same underlying relational subgraph structure, our learning method of contrasting isomorphic and non-isomorphic graphs discovers the relations across tasks in an unsupervised manner. Once the relations are learned, ViRel can then retrieve the shared relational graph structure for each task by parsing the predicted relational structure. Using a dataset based on grid-world and the Abstract Reasoning Corpus, we show that our method achieves above 95 classification, discovers the relation graph structure for most tasks, and further generalizes to unseen tasks with more complicated relational structures.

READ FULL TEXT VIEW PDF

page 7

page 13

page 15

page 16

page 17

11/17/2021

Learning to Compose Visual Relations

The visual world around us can be described as a structured set of objec...
07/23/2021

Constellation: Learning relational abstractions over objects for compositional imagination

Learning structured representations of visual scenes is currently a majo...
08/12/2017

Generalized Graph Pattern Matching

Most of the machine learning algorithms are limited to learn from flat d...
07/09/2020

Learning Graph Structure With A Finite-State Automaton Layer

Graph-based neural network models are producing strong results in a numb...
09/09/2016

Some Advances in Role Discovery in Graphs

Role discovery in graphs is an emerging area that allows analysis of com...
01/04/2019

Sheaves: A Topological Approach to Big Data

This document develops general concepts useful for extracting knowledge ...
04/22/2020

R-VGAE: Relational-variational Graph Autoencoder for Unsupervised Prerequisite Chain Learning

The task of concept prerequisite chain learning is to automatically dete...

1 Introduction

Our world is naturally compositional: where concepts, whether abstract or physical, are hierarchically composed of constituent concepts and their relations. In parallel, human intelligence has evolved to understand and reason with compositional structure. In the visual domain, this ability has endowed humans to quickly understand visual scenes and generalize to previously unseen, complex scenes. A key factor in compositionality is understanding visual relations, for example, that one object has the same shape as another.

A number of works have explored visual relations, for scene graph generation or image reconstruction [6; 8; 14], localization and grounding of subject-predicate-object triplets [15], dynamics prediction [4; 12], visual relation detection [17], few-shot classification [19], and compound relation classification [18]. These works have focused on applying visual relations to downstream higher-level tasks with known relation labels, or learning relations in a supervised setting.

In this work, we tackle a novel problem in visual relations, which is learning and discovering relations in the unsupervised setting, where relation types and labels are not known a priori. In the unsupervised setting, the relevant relations are learned from the data, removing the dependence on predefined relation types. This unlocks greater generalization capacity for learning, reasoning and compositionality. In contrast, when relations are learned with predefined labels in a supervised context, this limits us to settings which depend on those seen relations.

Our key insight is that the emergence of relations comes from graph-level analogy, where scenes within a task share a common relational subgraph consisting of concepts as nodes and relations as edges. We introduce a graph-isomorphism based objective to learn a distinct graph embedding (computed from relation embeddings with a GNN) for each task, which then encourages distinct clusters of relation embeddings to form. Once our method has learned the relations, our method retrieves the shared relational graph structure for each task by parsing the predicted relational structures.

While the types and appearances of concepts within a task may vary, ViRel is able to infer the global relation types, achieve above 95% accuracy in relation classification in all our 6 dataset configurations, and retrieve the common relational graph structure in seen tasks and further generalize to unseen, more complex tasks.

2 Task and Dataset

Figure 1: Two example BabyARC tasks and respective graphs. Each task contains images with a shared relation subgraph . The input are image examples under “Example Tasks”. Both the relation types and the graph of each task are unknown. The goal is to infer the relation graph , the corresponding relations, and the shared relation graph for each task . More example inputs are shown in Fig. 7.

Our setup is inspired by the Abstraction and Reasoning Corpus (ARC) dataset [5], where the tasks provided in the dataset aim to serve as a benchmark for human-level intelligence and reasoning. One of the inductive priors in many ARC tasks is that the training examples are graph-isomorphic, where there exists a common relational structure between the examples. A motivating ARC task example is shown in Fig. 8 and more in Fig. 9 and 10. This follows our motivation of applying graph-level analogy, for the shared relational structure between tasks and to discover the relations between objects.

However, due to the highly challenging and low data nature of ARC, we evaluate ViRel on BabyARC [20], a dataset generator which captures the graph isomorphic essence of ARC tasks. Moreover, BabyARC provides the underlying metadata of each generated example, allowing evaluation of our method, e.g., the accuracy of the predicted relations.

Datasets for visual reasoning such as CLEVR [9], SHAPES [2], or Visual Question Answering [3; 7; 13; 16] exist, but their setup differs greatly such as due to lack of configuration to specify shared graph structure, use of language query, their own additional task-specific complexities, or emphasis on different learning objectives.

2.1 BabyARC Setup

In the following, we introduce definitions for BabyARC:
Definition 1. Observation: represents a single image, which contains a collection of concepts, where is known.
Definition 2. Concept: represents an object in the given observation. Concepts may be rectangles, lines, etc.
Definition 3. Relation: represents the relationship between two objects, specifically a visual relation in the BabyARC setting. For example, if two objects share the same color, the relation between these two objects would be referred to as “same-color”. Similarly, “same-shape” represents that two objects have the same shape, and “inside” represents that one object is inside another object. If two objects do not have any relations, the relation label is “none”.

Our task definition on the BabyARC is the following:
We are given a collection of observations (images) , where each observation belongs to some known task . Each task has an unknown unique task graph with the nodes objects} and relations between objects}, and represents the relation type between the and object. All of its corresponding observations share this common relational subgraph .

In addition, only the observations of each task are provided to the model, without knowledge of the objects, the global relation types, or the underlying graph of each observation. The goal is to infer the global relation types, the graph of each observation , and the relational graph structure belonging to each task . To accurately infer the underlying graph , this requires the model to identify the correct relation type between every object pair.

Two example BabyARC tasks are shown in Fig. 1. These examples show that the graph of each observation is isomorphic to the task graph . However, it may be that only a subgraph of is isomorphic to task graph . We define objects which are part of the relational subgraph as “core” objects, and ones which are not as “distractor” objects. Distractor objects can be viewed as random objects added to the observation and irrelevant to task graph . The distinction between distractor and core objects is not given to the model. Examples of tasks with distractor objects are shown in Fig. 5.

3 Method

3.1 Concept-Relation Graph Neural Network (CR-GNN) Architecture

Figure 2: CR-GNN architecture, consisting of a CNN object encoder , a MLP relation encoder , and a GIN .

We propose the Concept-Relation Graph Neural Network (CR-GNN) architecture, which takes advantage of the subgraph isomorphic properties of each task. Our CR-GNN architecture consists of a CNN object encoder

to encode each object, and outputs object embedding for the object from image , a MLP relation encoder to encode the relation between the and object.

Encoding the input image through the object and relation encoders, the image is represented as a latent graph , with being its node features (note that here we use the line-graph representation, where the node features are the relation embedding), and two nodes are connected if two relations shares the same object. After, a 2 layer graph isomorphism network (GIN) [21] is applied to encode the learned line-graph. An illustration of the CR-GNN architecture is shown in Fig. 2.

The model design choices are guided by the goal of the task objective. The CNN object encoder embeds each visual object into an embedding vector, and the MLP relation encoder maps each pairwise concatenation of objects into a relation embedding vector. We output an explicit representation for the relation vector encoding as this forces the GNN encoder to depend only on the relational properties of the input, rather than other properties of the objects themselves. We apply a GIN encoder on top of the line-graph constructed with the relation latent vectors to take advantage that all observations of the same task are subgraph-isomorphic.

3.2 Loss function objective

We introduce and explore two main loss objectives to learn the CR-GNN model. The first objective is a contrastive learning based objective which minimizes the distance of the graph representation of each example of the same task (intra-task), while maximizing the distance with respect to examples not in the same task (inter-task). The mathematical formulation of the contrastive objective is:

(1)

where the first summation defines the intra-task loss, and the second summation defines the inter-task loss. Given that each task shares a common subgraph, its graph representation should be similar within the task (intra-task loss), and should be different between different tasks (inter-task loss).

here is a margin hyperparameter.

The second objective is a classification (cross-entropy) based objective. Each example is classified as one of task graphs, using the true task label as ground truth. The mathematical formulation of the classification objective is:

(2)

where is the standard cross-entropy loss, between the true task ID

against the predicted task ID. The predictions logits are obtained via a linear layer following the graph representation learned from the GIN

. This linear layer is trained along-side the other model components.

An additional loss term, which is the information bottleneck loss, allows constraining the information between the observation and the relation embedding , which forces the relation latent dimension to use more clustered embedding. This can be seen as a regularization on to capture only the most relevant information about , which is the relation types in our case. The mathematical formulation of the information bottleneck (IB) is:

(3)

where is the mutual information function (Eq. 4),

is the random variable concatenating the relation embedding

for an observation . This implementation follows from Alemi et al. [1] which uses a variational upper bound [10] for tractable computation.

4 Experiments

4.1 Setup

In the following experiments, the global relation types for BabyARC dataset are defined to be “none”, “same-shape”, “same-color”, and “inside”. Visual examples of these relations are shown in Fig. 3. The object shapes are a rectangle (“rect-solid”), hollow rectangle (“rect”), “Lshape”, and line “line”, and see Fig. 4 for some examples.

Our dataset specifications also allow us to generate datasets of varying numbers of core objects and distractor objects. We investigate two categories of datasets: tasks containing 2-3 core objects, and tasks containing 2-4 core objects. We also vary the number of distractor objects, with three configurations: no distractor objects, 1 distractor object, and 0-2 distractor objects. The relation specification for each task are described in Appendix A.4.

Examples of two tasks of the (2-3 core, 0 distractor) generated dataset is shown in Fig. 1. In the first image of Task 0, the brown rectangle has the relation “same-color” with the brown “Lshape”, and the brown “Lshape” has the relation “same-shape” as the yellow “Lshape”.

We train our CR-GNN model with the two loss objectives as previously defined, and on datasets with 2-3 core objects or 2-4 core objects, and varying number of distractor objects depending on the configuration. We also observed the effect of adding the IB loss. Our training and model hyperparameters are described in Appendix A.2.

4.2 Results

Method # Distractors
0 1 0-2
Classify 0.923 0.926 0.946
Classify + IB 0.919 0.918 0.901
Contrastive 0.959 0.961 0.954
Contrastive + IB 0.952 0.963 0.957
Best 0.959 0.963 0.957
Table 1: Relation classification accuracy for 2-3 core objects

The model is evaluated on two main aspects: the accuracy of the model with predicting the correct relation type between two objects, and inferring the relational graph structure belonging to each task. These evaluations are based off the dataset task goals we have defined in Section 2.

We evaluate the relation prediction accuracy as follows: we apply k-means clustering to assign cluster labels to each of the learned relation embeddings . We permute globally how each cluster label is assigned to the ground-truth relation label. The maximum accuracy with respect to the ground truth label is then taken. We only compute the relation accuracy between objects which contain a relation label, as we do not know the underlying relation of object pairs without labels. Model accuracy evaluation is done on a validation dataset with the same parameters used to generate the training dataset, but with different random seed.

Table 1 & 2 show the relation classification accuracy for our configurations. The model performance is similar between both objectives, with the contrastive objective performing slightly better. In varying the number of introduced distractor objects in both cases, there is not any accuracy degradation due to this introduction.

Method # Distractors
0 1 0-2
Classify 0.956 0.955 0.965
Classify + IB 0.960 0.962 0.959
Contrastive 0.965 0.971 0.965
Contrastive + IB 0.960 0.973 0.971
Best 0.965 0.973 0.971
Table 2: Relation classification accuracy for 2-4 core objects

Comparing the 2-3 core objects dataset against the 2-4 core objects dataset, there is an overall slight improvement with training on the 2-4 core objects dataset. This is because the 2-4 core objects dataset includes a greater number of task examples (13 vs 6), which allows learning better relation representations in order for the model to distinguish between different graphs.

Our model is able to infer the global relation types, as shown in Fig. 6, a t-SNE visualization of the learned relation embeddings. The model clusters relation embeddings of the same relation label close to each other, even when not given the ground truth relation labels.

To infer the relational graph structure belonging to each task, k-means evaluation method is used to predict the cluster labels of the learned relation. The graph for each observation is then constructed with these predicted labels. Since the number of objects in the observations of each task may be different due to distractors, we take the maximum common subgraph (MCS) of all the same-task constructed graphs, which allows identifying the shared relational graph belonging to each task.

Our MCS retrievals for the 2-4 core object, 0 distractor dataset with the contrastive + IB objective are in Appendix A.5, with evaluation method details described. ViRel is trained and evaluated on the same 2-4 core object, 0 distractor dataset. The top 3 most frequently obtained maximum common subgraphs for each task are shown, and the ground truth relational graph occurs in the top 3 retrievals for most tasks. This demonstrates that ViRel is able to uncover the ground truth shared relational graph structure for most tasks.

Furthermore, we demonstrate the ability of our model to generalize to tasks unseen during training. We train ViRel on the 2-3 core object, 1 distractor dataset with the contrastive + IB objective, and at inference time, evaluate MCS retrievals on the 2-4 core object, 0 distractor dataset. The relational structure of Tasks 6 to 12 are not seen during training, but obtains the correct MCS in the top 3 retrievals for most tasks. The full retrievals are shown in Appendix A.6.

In this work, we propose ViRel for unsupervised learning and discovery of visual relations with graph-level analogy. We show that ViRel is able to infer the global relation types, achieve above 95% accuracy in relation classification, and retrieve the shared relational graph structure for most tasks.

One limitation of ViRel is that it only learns the necessary relation representations needed to distinguish between the given tasks, and thus does not distinguish between relations without a label “none“, and the “same-color“ relation. In the MCS retrievals, extraneous predictions of “same-color“ are found due to this reason. Future work would be understanding how to separate these two representations, with a promising approach being introducing more diverse tasks.

5 Acknowledgement

We thank Rok Sosič for discussions and for providing feedback on our manuscript. We also gratefully acknowledge the support of DARPA under Nos. HR00112190039 (TAMI), N660011924033 (MCS); ARO under Nos. W911NF-16-1-0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under Nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions), NIH under No. 3U54HG010426-04S1 (HuBMAP), Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Amazon, Docomo, GSK, Hitachi, Intel, JPMorgan Chase, Juniper Networks, KDDI, NEC, and Toshiba.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding entities.

References

  • [1] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. CoRR abs/1612.00410. External Links: Link, 1612.00410 Cited by: §3.2.
  • [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2015) Deep compositional question answering with neural module networks. CoRR abs/1511.02799. External Links: Link, 1511.02799 Cited by: §2.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. CoRR abs/1505.00468. External Links: Link, 1505.00468 Cited by: §2.
  • [4] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. CoRR abs/1612.00222. External Links: Link, 1612.00222 Cited by: §1.
  • [5] F. Chollet (2019) On the measure of intelligence. CoRR abs/1911.01547. External Links: Link, 1911.01547 Cited by: §2.
  • [6] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling (2019) Scene graph generation with external knowledge and image reconstruction. CoRR abs/1904.00560. External Links: Link, 1904.00560 Cited by: §1.
  • [7] D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for compositional question answering over real-world images. CoRR abs/1902.09506. External Links: Link, 1902.09506 Cited by: §2.
  • [8] J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. CoRR abs/1804.01622. External Links: Link, 1804.01622 Cited by: §1.
  • [9] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick (2016) CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. CoRR abs/1612.06890. External Links: Link, 1612.06890 Cited by: §2.
  • [10] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv. External Links: Document, Link Cited by: §3.2.
  • [11] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §A.2.
  • [12] T. Kipf, E. van der Pol, and M. Welling (2019) Contrastive learning of structured world models. External Links: Document, Link Cited by: §1.
  • [13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. CoRR abs/1602.07332. External Links: Link, 1602.07332 Cited by: §2.
  • [14] Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang (2019) PasteGAN: A semi-parametric method to generate image from scene graph. CoRR abs/1905.01608. External Links: Link, 1905.01608 Cited by: §1.
  • [15] Y. Li, W. Ouyang, and X. Wang (2017)

    ViP-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection

    .
    CoRR abs/1702.07191. External Links: Link, 1702.07191 Cited by: §1.
  • [16] F. Liu, G. Emerson, and N. Collier (2022) Visual spatial reasoning. arXiv. External Links: Document, Link Cited by: §2.
  • [17] C. Lu, R. Krishna, M. S. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. CoRR abs/1608.00187. External Links: Link, 1608.00187 Cited by: §1.
  • [18] M. Shanahan, K. Nikiforou, A. Creswell, C. Kaplanis, D. Barrett, and M. Garnelo (2019) An explicitly relational neural network architecture. CoRR abs/1905.10307. External Links: Link, 1905.10307 Cited by: §1.
  • [19] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales (2017) Learning to compare: relation network for few-shot learning. CoRR abs/1711.06025. External Links: Link, 1711.06025 Cited by: §1.
  • [20] T. Wu, M. Tjandrasuwita, Z. Wu, X. Yang, K. Liu, R. Sosič, and J. Leskovec (2022) ZeroC: a neuro-symbolic model for zero-shot concept recognition and acquisition at inference time. Under review. Cited by: §2.
  • [21] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.1.

Appendix A Appendix


Figure 3: Example of each of the three global relation types of the BabyARC dataset.

Figure 4: Concepts: Examples of “Lshape“ and “Rect“.

Figure 5: Example of Task 2 (2 core objects), with red rectangle distractor object, and Task 0 (3 core objects), with green distractor object. The core objects are part of the relational subgraph.

Figure 6: t-SNE visualization of learned relation embeddings, for 2-3 core, 0-2 distractor dataset with contrastive + IB objective. The colors represent ground truth labels, where label 0 represents no relationship label “none”, 1 is “inside”, 2 is “same-color”, 3 is “same-shape”.

a.1 Mutual Information Function

The mutual information for PDFs for continuous distributions is defined as the following:

(4)

a.2 Hyperparameters: Model Architecture and Training

The CNN object encoder is a 4 layer CNN, MLP relation encoder is a 3 layer MLP, and GIN graph encoder is 2 layer GNN with 3 layer MLP for each GNN layer. All the activation functions used are the LeakyRelu function. The model was optimized using the Adam

[11] optimizer with learning rate, 0.9 momentum, and no weight decay. The latent dimension for objects is 100, and the latent dimension for relations is 20. The input with is an image of , where is width by height, 9 is the number of total colors, and is the maximum number of possible objects in the dataset. The ground truth object masks are provided as input, as our work is focused on relation discovery, and not on object discovery. The margin hyperparameter for the contrastive loss is

, where 20 is the relation latent dimension. Each image is augmented with probability

, where the possible augmentations are random flipping, rotating, resizing, and color. The weighting for each respective losses are for contrastive, if used, for classify, if used, and for IB, if used. In training, each task contains around 200-300 image examples.

a.3 Limitations

One current limitation is that the model only learns the necessary relation representations needed to distinguish between the given tasks. We observed this similarly in our earlier experiments, where we evaluated our method in a setting with a limited number of tasks, and our method only learned two relation representations as it was sufficient to separate the tasks: “inside“ and “same-shape\same-color“.

Our method currently does not distinguish between relations without a label “none“, and the “same-color“ relation. This can be visualized in the t-SNE visualization in Fig. 6, where the “none“ and “same-color“ relation distributions overlap with each other.

a.4 Tasks in each Dataset

The notation is the following: [(0, 1), ’same-color’] represents that the relation ’same-color’ holds between the 0th and the 1st object. Only core objects are defined in the relations specification, not distractor objects. Visual examples for each task are shown in Fig. 7.

For datasets containing 2-3 core objects, the tasks in the dataset are the following:

Task 0 (3 objects): [(0, 1), ’same-color’], [(1, 2), ’same-shape’]
Task 1 (2 objects): [(0, 1), ’same-color’]
Task 2 (2 objects): [(0, 1), ’inside’]
Task 3 (3 objects): [(0, 1), ’inside’], [(1, 2), ’same-color’]
Task 4 (3 objects): [(0, 1), ’inside’], [(1, 2), ’same-shape’]
Task 5 (3 objects): [(0, 1), ’same-color’], [(1, 2), ’same-color’]

For datasets containing 2-4 core objects, the tasks in the dataset are the following:

Task 0 (2 objects): [(0, 1), ’inside’]
Task 1 (3 objects): [(0, 1), ’inside’], [(1, 2), ’same-color’]
Task 2 (3 objects): [(0, 1), ’inside’], [(1, 2), ’same-shape’]
Task 3 (2 objects): [(0, 1), ’same-color’]
Task 4 (3 objects): [(0, 1), ’same-color’], [(1, 2), ’same-color’]
Task 5 (3 objects): [(0, 1), ’same-color’], [(1, 2), ’same-shape’]
Task 6 (4 objects): [(0, 2), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Task 7 (4 objects): [(0, 1), ’inside’], [(1, 2), ’same-color’], [(2, 3), ’same-color’]
Task 8 (4 objects): [(0, 2), ’same-shape’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Task 9 (4 objects): [(0, 1), ’inside’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Task 10 (4 objects): [(0, 1), ’inside’], [(1, 2), ’same-shape’], [(2, 3), ’same-shape’]
Task 11 (4 objects): [(0, 1), ’inside’], [(1, 2), ’same-shape’], [(2, 3), ’same-color’]
Task 12 (4 objects): [(0, 2), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-color’]

a.5 Task Maximum Common Subgraph Retrieval, with all seen tasks

The following show the Maximum Common Subgraph Retrievals for ViRel trained on 2-4 core object, 0 distractor dataset with the contrastive + IB objective. The evaluation dataset is the same, on 2-4 core object, 0 distractor dataset. Thus, relational structure of all the tasks shown is seen by the model during training.

To obtain the shared relational graph structure for each task, we would ideally take the maximum common subgraph of all the constructed graphs on each task. In practice, however, the relation predictions are not perfect and one incorrect relation prediction would then change the whole maximum common subgraph. To mitigate this, we compute maximum common subgraph of subgroups instead of all the observations at once, and then use a counting mechanism to identify the most frequently retrieved maximum common subgraph. The size of these subgroups is referred as the ”group size”. In the following result, the group size used is 5. The notation for interpreting the result is same as in Section A.4.

Task 0:
Count:
420, MCS: [(0, 1), ’inside’]

Task 1:
Count:
287, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(1, 2), ’same-color’]
Count: 133, MCS: [(0, 1), ’inside’], [(1, 2), ’same-color’]

Task 2:
Count:
337, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(1, 2), ’same-shape’]
Count: 83, MCS: [(0, 1), ’inside’], [(1, 2), ’same-shape’]

Task 3:
Count:
420, MCS: [(0, 1), ’same-color’]

Task 4:
Count:
270, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(1, 2), ’same-color’]
Count: 150, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’]

Task 5:
Count:
305, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(1, 2), ’same-color’]
Count: 115, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’]

Task 6:
Count:
159, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-shape’]
Count: 144, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Count: 78, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]

Task 7:
Count:
230, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 101, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 69, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’]

Task 8:
Count:
242, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(0, 3), ’same-shape’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-shape’]
Count: 110, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(0, 3), ’same-shape’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Count: 39, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(0, 3), ’same-shape’], [(2, 3), ’same-shape’]

Task 9:
Count:
195, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-shape’]
Count: 112, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Count: 97, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(2, 3), ’same-shape’]

Task 10:
Count:
253, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-shape’], [(2, 3), ’same-shape’]
Count: 117, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-shape’]
Count: 37, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’]

Task 11:
Count:
196, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 117, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-color’]
Count: 82, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’]

Task 12:
Count:
15, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 9, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 2, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’]

a.6 Task Maximum Common Subgraph Retrieval, with unseen tasks

The following show the Maximum Common Subgraph Retrievals for ViRel trained on 2-3 core object, 1 distractor dataset with the contrastive + IB objective. The evaluation dataset is different, on the 2-4 core object, 0 distractor dataset. As a result, the relational structure of Tasks 6 to 12 are not seen by the model during training.

Task 0:
Count:
420, MCS: [(0, 1), ’inside’]

Task 1:
Count:
270, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(1, 2), ’same-color’]
Count: 150, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’]

Task 2:
Count:
360, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(1, 2), ’same-shape’]
Count: 60, MCS: [(0, 1), ’inside’], [(1, 2), ’same-shape’]

Task 3:
Count:
420, MCS: [(0, 1), ’same-color’]

Task 4:
Count:
255, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(1, 2), ’same-color’]
Count: 165, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’]

Task 5:
Count:
214, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(1, 2), ’same-color’]
Count: 206, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’]

Task 6:
Count:
148, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-shape’]
Count: 146, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’]
Count: 79, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(1, 2), ’same-color’], [(1, 3), ’same-color’]

Task 7:
Count:
220, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-color’]
Count: 116, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 67, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’]

Task 8:
Count:
135, MCS: [(0, 1), ’same-color’], [(0, 3), ’same-shape’], [(1, 2), ’same-color’], [(1, 3), ’same-color’]
Count: 120, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(0, 3), ’same-shape’], [(1, 2), ’same-color’], [(1, 3), ’same-color’]
Count: 107, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-shape’], [(0, 3), ’same-shape’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-shape’]

Task 9:
Count:
137, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(2, 3), ’same-shape’]
Count: 128, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-shape’]
Count: 108, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(2, 3), ’same-shape’]

Task 10:
Count:
258, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-shape’], [(2, 3), ’same-shape’]
Count: 112, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-shape’], [(2, 3), ’same-shape’]
Count: 46, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’]

Task 11:
Count:
201, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 98, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’], [(1, 3), ’same-color’]
Count: 93, MCS: [(0, 1), ’inside’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-shape’]

Task 12:
Count:
16, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 9, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(1, 2), ’same-color’], [(1, 3), ’same-color’], [(2, 3), ’same-color’]
Count: 2, MCS: [(0, 1), ’same-color’], [(0, 2), ’same-color’], [(0, 3), ’same-color’], [(2, 3), ’same-color’]

a.7 Additional BabyARC examples

Figure 7: BabyARC: The images above depict 3 examples for each task. While only 3 are shown for each task, we use 200-300 examples per task during training.

a.8 Additional ARC example tasks

Figure 8: Example of an ARC task. The first column represents the input, and the second column represents the output, and the first two rows are independent training examples. The last row is the test example for this task, where only the input image is given. In this task, the objective is to color all the cyan objects with the same pattern as the non-cyan object, and then remove the original non-cyan object. This procedure explains obtaining the output from the input. All the training examples share a common relational graph structure, between the inputs and between the outputs. In the inputs, all the objects hold the relation “same-shape“, and three of the objects hold the relation “same-color“. In the outputs, all the three objects hold the relation “same-shape“ and “same-color“. The ability to learn and identify relations in unsupervised manner is essential, as this task requires both understanding of “same-shape“ and “same-color“ relations.

Figure 9: Another example of an ARC task. See caption in Fig. 8 for general task description. In this task, the objective is to count the unique distinct objects by shape and color, and present the object with the largest count. In the inputs, the relational graph is the following: various groups of objects share the relation of “same-color“ and “same-shape“ with each other.

Figure 10: Another example of an ARC task. See caption in Fig. 8 for general task description. In this task, the objective is identify the object with blue and green colors, and copy the object to the other shapes by aligning the red and yellow pixels. In the inputs, the relational graph is the following: all but one of the objects hold the relation “same-shape“ and “same-color“. In the outputs, all of the objects hold the relation “same-shape“ and “same-color“.