An Explicitly Relational Neural Network Architecture

05/24/2019 ∙ by Murray Shanahan, et al. ∙ Google 13

With a view to bridging the gap between deep learning and symbolic AI, we present a novel end-to-end neural network architecture that learns to form propositional representations with an explicitly relational structure from raw pixel data. In order to evaluate and analyse the architecture, we introduce a family of simple visual relational reasoning tasks of varying complexity. We show that the proposed architecture, when pre-trained on a curriculum of such tasks, learns to generate reusable representations that better facilitate subsequent learning on previously unseen tasks when compared to a number of baseline architectures. The workings of a successfully trained model are visualised to shed some light on how the architecture functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 14

Code Repositories

AI-Methods

Kinds of Methods in AI.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When humans face novel problems, they are able to draw effectively on past experience with other problems that are superficially very different, but that have similarities on a more abstract, structural level. This ability is essential for lifelong, continual learning, and confers on humans a degree of data efficiency, powers of transfer learning, and a capacity for out-of-distribution generalisation that contemporary machine learning has yet to match 

garnelo2016towards; lake2017building; marcus2018deep. A case may be made that all these issues are different facets of the same underlying challenge, namely the challenge of devising systems that learn to construct general-purpose, reusable representations mccarthy1987generality; bengio2013representation. A representation is general-purpose and reusable to the extent that it contains information whose domain of application exceeds the context within which it was acquired.

Representations that are general-purpose and reusable improve data efficiency because a system that already knows how to build representations relevant to a novel task (despite its novelty) doesn’t have to learn that task from scratch. Ideally, a system that efficiently exploits general-purpose, reusable representations in this way should be the very same system that learned how to construct them in the first place. Moreover, in learning to solve a novel task using such representations, we should expect the system to learn further representations that are themselves general-purpose and reusable. So, with the exception of the very first representations the system learns, all learning in such a system would in effect be transfer learning, and the process of learning would be inherently cumulative, continual, and lifelong.

One approach to building such a system is to take inspiration from the paradigm of classical, symbolic AI garnelo2019reconciling. Building on the mathematical foundations of first-order predicate calculus, a typical symbolic AI system works by applying logic-like rules of inference to language-like propositional representations whose elements are objects and relations. Thanks to their declarative character and compositional structure, these representations lend themselves naturally to generality and reusability. However, in contrast to contemporary deep learning systems, the representations deployed in classical AI are not usually learned from data but hand-crafted harnad1990symbol. The aim of the present work is to get the best of both worlds with an end-to-end differentiable neural network architecture that builds in propositional, relational priors in much the same way that a convolutional network builds in spatial and locality priors.

The architecture introduced here builds on recent work with non-local network architectures that learn to discover and exploit relational information wang2018nonlocal, notably relation nets santoro2017simple; palm2018recurrent and architectures based on multi-head attention vaswani2017attention; zambaldi2019deep

. However, these architectures generate representations that lack explicit structure. There is, in general, no straightforward mapping from the parts of a representation to the usual elements of a symbolic medium such as predicate calculus: propositions, relations, and objects. To the extent that these elements are present, they are smeared across the embedding vector, which makes them hard to interpret and can make it difficult for downstream processing from taking advantage of compositionality.

Here we present an architecture, which we call a PrediNet, that learns representations whose parts map directly onto propositions, relations, and objects. To build a sound scientific understanding of the proposed architecture, and to facilitate a detailed comparison with other architectures, the present study focuses on simple tasks requiring relatively little data and computation. We develop a family of small, simple visual datasets that can be combined into a variety of multi-task curricula and used to assess the extent to which an architecture learns representations that are general-purpose and reusable. We report the results of a number of experiments using these datasets that demonstrate the potential of an explicitly relational network architecture to improve data efficiency and generalisation, to facilitate transfer, and to learn reusable representations.

2 The PrediNet Architecture

The idea that propositions are the building blocks of knowledge dates back to the ancient Greeks, and provides the foundation for symbolic AI, via the century mathematical work of Boole and Frege russell2009artificial

. An elementary proposition asserts that a relationship holds between a set of objects. Propositions can be combined using logical connectives (and, or, not, etc), and can participate in inference processes such as deduction. The task of the PrediNet is to (learn to) transform high-dimensional data such as images into propositional representations that are useful for downstream processing. A PrediNet module (Fig. 

1) can be thought of as a pipeline comprising three stages: attention, binding, and evaluation. The attention stage selects pairs of objects of interest, the binding stage instantiates the first two arguments of a set of three-place predicates (relations) with selected object pairs, and the evaluation stage computes values for each predicate’s remaining (scalar) argument such that the resulting proposition is true.

Figure 1: The PrediNet architecture. and are shared across heads, whereas and are local to each head. See main text for more details.

More precisely, a PrediNet module comprises heads, each of which computes relations between pairs of objects. For a given input , each head computes the same set of relations (using shared weights ) but selects a different pair of objects, using dot-product attention based on key-query matching vaswani2017attention. Each head computes a separate pair of queries and (via and ), but the key space (defined by ) is shared between heads.

Applying the resulting pair of attention masks directly to yields a pair of objects and , each represented by a weighted sum of feature vectors.

All relations between and are then evaluated. There are many ways to compute a relationship between a pair of objects represented as vectors. In the current architecture, and are subject to a linear mapping (via ) into 1D spaces, one per relation, and the resulting vector is passed through an element-wise comparator, yielding a vector of differences .

The last two elements of and (the positions and , respectively) are concatenated to the vector of differences to give the head’s output . Finally, the outputs of all heads are concatenated, yielding the output of the PrediNet module, a vector of length . In predicate calculus terms, the final output of a PrediNet module with heads and relations represents the conjunction of elementary propositions

(1)

where asserts that is the distance between objects and in the 1D space defined by column of the weight matrix , and the denotations of and are captured by the vectors and respectively, given the key-space defined by .

3 Datasets and Tasks

It would be premature to apply the PrediNet architecture to rich, complex data before we have a basic understanding of its properties and its behaviour. To facilitate in-depth scientific study, we need small, simple datasets that allow the operation of the architecture to be examined in detail and the fundamental premises of its design to be assessed. Our experimental goals in the present paper are 1) to test the hypothesis that the PrediNet architecture learns representations that are general-purpose and reusable, and 2) insofar as this is true, to investigate why. To do this, we devised a configurable family of simple classification tasks that we collectively call the Relations Game.

Figure 2: Relations Game object sets and tasks. (a) Example objects from the training set and held-ot test set. (b) There are five possible row / column patterns. In a multi-task setting, recognising each row pattern is a separate task. (c) Three examples tasks for the single-task setting. (d) An example target task (left) and curriculum (right) for the multi-task setting.

A Relations Game task involves the presentation of an image containing a number of objects laid out on a 3x3 grid, and the aim (in most tasks) is to label the image as True or False according to whether a given relation holds among the objects in the image. While the elementary propositions learned by the PrediNet only assert simple relationships between pairs of entities, Relations Game tasks generally involve learning compound relations involving multiple relationships among many objects. The objects in question are drawn from either a training set or one of two held-out sets (Fig. 2a). None of the shapes or colours in the training set occurs in either of the held-out sets. The training object set contains 8 uniformly coloured pentominoes and their rotations and reflections (37 shapes in all) with 25 possible colours. The first held-out object set contains 8 uniformly coloured hexominoes and their rotations and reflections (46 shapes in all) with 25 possible colours, and the second held-out object set contains only squares, but with a striped pattern of held-out colours.

Each Relations Game task is tied to a given relation. Even with such a simple setup, the number of definable relations among all possible combinations of objects is astronomical ( for distinct objects), although only a few of them will make intuitive sense. For the present study, we defined a handful of intuitively meaningful relations and generated corresponding labelled datasets comprising 50% positive and 50% negative examples. A selection is shown in Fig. 2c. The ‘between’ relation holds iff the image contains three objects in a line in which the outer two objects have the same shape, orientation, and colour. The ‘occurs’ relation holds iff there is an object in the bottom row of three objects that has the same shape, orientation, and colour as the (single) object in the top row. The ‘same’ relation holds iff the image contains two objects of the same shape, orientation, and colour. In each case, we balanced the set of negative examples to ensure that “tricky” images involving pairs of objects with the same colour but different shape or the same shape but different colour occur just as frequently as those with objects that differ in both colour and shape.

4 Experimental setup

At the top level, each architecture we consider in this paper comprises 1) a single convolutional input layer (CNN), 2) a central module (which might be a PrediNet or a baseline), and 3) a small output multi-layer perceptron (MLP) (Fig. 

1b). A pair of xy co-ordinates is appended to each CNN feature vector, denoting its position in convolved image space and, where applicable, a one-hot task identifier is appended to the output of the central module. For most tasks, the final output of the MLP is a one-hot label denoting True or False. The PrediNet was evaluated by comparing it to several baselines: two MLP baselines (MLP1 and MLP2), a relation net baseline santoro2017simple (RN), and a multi-head attention baseline vaswani2017attention; zambaldi2019deep (MHA).

To facilitate a fair comparison, the top-level schematic is identical for the PrediNet and for all baselines (Fig. 3

). All use the same input CNN architecture and the same output MLP architecture, and differ only in the central module. In MLP1, the central module is a single fully-connected layer with ReLu activations, while MLP2’s has two layers. In RN, the central module computes the set of all possible pairs of feature vectors, each of which is passed through a 2-layer MLP; the resulting vectors are then aggregated by taking their element-wise means to yield the output vector. Finally, MHA comprises multiple heads, each of which generates mappings from the input feature vectors to sets of keys

, queries , and values , and then computes . Each head’s output is a weighted sum of the resulting vectors, and the output of the MHA central module is the concatenation of all its heads’ outputs. The PrediNet used here comprises heads and relations (Fig. 1

). All reported experiments were carried out using stochastic gradient descent, and all results shown are averages over 10 runs. Further experimental details are given in the Supplementary Material, which also shows results for experiments with different numbers of heads and relations, and with the Adam optimiser, all of which present qualitatively similar results.

Figure 3: The four-stage experimental protocol for multi-task curriculum training.

To assess the generality and reusability of the representations produced by the PrediNet, we adopted a four-stage experimental protocol wherein 1) the network is pre-trained on a curriculum of one or more tasks, 2) the weights in the input CNN and PrediNet are frozen while the weights in the output MLP are re-initialised with random values, and 3) the network is retrained on a new target task or set of tasks (Fig. 3). In step 3, only the weights in the output MLP change, so the target task can only be learned to the extent that the PrediNet delivers re-usable representations to it, representations the PrediNet has learned to produce without exposure to the target task. To assess this, we can compare the learning curves for the target task with and without pre-training. We expect pre-training to improve data efficiency, so we should see accuracy increasing more quickly with pre-training than without it. For evidence of transfer, and to confirm the hypothesis of reusability, we are also interested in the final performance on the target task after pre-training, given that the weights of the pre-trained input CNN and PrediNet are frozen. This measure indicates how well a network has learned to form useful representations. The more different the target task is from the pre-training curriculum, the more impressed we should be that the network is able to learn the target task.

5 Results

As a prelude to investigating the issues of generality and reusabilty, we studied the data efficiency of the PrediNet architecture in a single-task Relations Game setting. Results obtained on a selection of five tasks – ‘same’, ‘between’, ‘occurs’, ‘xoccurs’, and ‘colour / shape’ – are summarised in Table 1. The first three tasks are as described in Fig. 2. The ‘xoccurs’ relation is similar to occurs. It holds iff the object in the top row occurs in the bottom row and the other two objects in the bottom row are different. The ‘colour / shape’ task involves four labels, rather than the usual two: same-shape / same-colour; different-colour / same-shape; same-colour / different shape; different-colour / different shape. In the dataset for this task, each image contains two objects randomly placed, and one of the four labels must be assigned appropriately. Table 1 shows the accuracy obtained by each of the five architectures after 100,000 batches when tested on the two held-out object sets. The PrediNet is the only architecture that achieves over 90% accuracy on all tasks with both held-out object sets after 100,000 batches. On the ‘xoccurs’ task, the PrediNet out-performs the baselines by more than 10%, and on the ‘colour / shape’ task (where chance is 25%), it out-performs all the baselines except MHA by 25% or more.

Table 1: Data efficiency in a single-task Relations Game setting.

Next, using the protocol outlined in Fig. 3, we compared the PrediNet’s ability to learn re-usable representations with each of the baselines. We looked at a number of combinations of target tasks and pre-training curriculum tasks. Fig. 4 depicts our findings for one these combinations in detail, specifically three target tasks corresponding to three of the five possible column patterns (ABA, AAB, and ABB (Fig. 2d)), and a pre-training curriculum comprising the single ‘between’ task. The plots present learning curves for each of the five architectures at each of the four stages of the experimental protocol. In all cases, accuracy is shown for the ‘stripes’ held-out object set (not the training set). Of particular interest are the (green) curves corresponding to Stage 3 of the experimental protocol. These show how well each architecture learns the target task(s) after the central module has been pre-trained on the curriculum task(s) and its weights are frozen. The PrediNet learns faster than any of the baselines, and is the only one to achieve an accuracy of 90%. The rapid reusability of the representations learned by both the MHA baseline and the PrediNet is noteworthy because the ‘between’ relation by itself seems an unpromising curriculum for subsequently learning the AAB and ABB column patterns. As the (red) curve for Stage 4 of the protocol shows, the reusability of the PrediNet’s representations cannot be accounted for by the pre-training of the input CNN alone.

Figure 4: Multi-task curriculum training. The target tasks are three column patterns (AAB, ABA, and ABB) and the sole curriculum task is the ‘between’ relation.
Figure 5: Reusability of representations learned with a variety of target and pre-training tasks

Fig. 5 shows a larger range of target task / curriculum task combinations, concentrating exclusively on the Stage 3 learning curves. Here a more complete picture emerges. In both Fig. 5a and Fig. 5d the target task is ‘match rows’ (Fig. 2d), but they differ in their pre-training curricula. The curriculum for Fig. 5d is three of the five row patterns (ABA, AAB, and ABB). This is the only case where the PrediNet does not learn representations that are more useful for the target task than those of all the baselines, outperforming only two of the four. However, when the curriculum is the three analogous column patterns rather than row patterns, the performance of all four baselines collapses to chance, while the PrediNet does well, attaining similar performance as for the row-based curriculum (Fig. 5a). This suggests the PrediNet is able to learn representations that are orientation invariant, which aids transfer. This hypothesis is supported by Fig. 5e, where the target tasks are all five row patterns, while the curriculum is all five column patterns. None of the baselines is able to learn reusable representations in this context; all remain at chance, whereas the PrediNet achieves 85% accuracy.

To better understand the operation of the PrediNet, we carried out a number of visualisations. One way to find out what the PrediNet’s heads learn to attend is to submit images to a trained network and, for each head , apply the two attention masks and to each of the feature vectors in the convolved image . The resulting matrix can then be plotted as a heat map to show how attention is distrubuted over the image. We did this for a number of networks trained in the single-task setting. Fig. 6a shows two examples, and the Supplementary Material contains a more extensive selection. As we might expect, most of the attention focuses on the centres of single objects, and many of the heads pick out pairs of distinct objects in various combinations. But some heads attend to halves or corners of objects. Although most attention is focal, whether directed at object centres or object parts, some heads exhibit diffuse attention, which is possible thanks to the soft key-query matching mechanism. So the PrediNet can (but isn’t forced to) treat the background as a single entity, or to treat an identical pair of objects as a single entity.

Figure 6:

(a) Attention heat maps for the first four heads of a trained PrediNet. Top: trained on the ‘same’ task. Bottom: trained on the ‘occurs’ task. (b) Principal component analysis. Top: PCA on the output of a selected head for a PrediNet trained on the ‘colour / shape’ task. Top left: for pentominoes images (training set). Top right: for hexominoes (held-out test set). Bottom left: the same analysis applied to a representative head of the MHA baseline with pentominoes (training set).

To gain some insight into how the PrediNet encodes relations, we carried out principal component analysis (PCA) on each head of the central module’s output vectors for a number of trained networks, again in the single-task setting (Fig. 6b). We chose the four-label colour/shape task to train on, and mapped 10,000 example images onto the first two principal components, colouring each with their ground-truth label. We found that, for some heads, differences in colour and shape appear to align along separate axes (Fig. 6b). This contrasts with the MHA baseline, whose heads don’t seem to individually cluster the labels in a meaningful way. For the other baselines, which lack the multi-head organisation of the PrediNet and the MHA network, the only option is to carry out PCA on the whole output vector of the central module. Doing this, however, does not produce interpretable results for any of the architectures (Fig. S7). We also identified the heads in the PrediNet that attended to both objects in the image and found that they overlapped almost entirely with those that meaningfully clustered the labels (Fig. S9). In a final analysis, we found that the PrediNet was significantly more robust than the MHA to pruning a random subset of heads at test time – and if pruned to leave only those that attended to the two objects, the performance of the full network could be captured with just a handful of heads (Fig. S10). Taken together, these results are suggestive of something we might term relational disentangling in the PrediNet.

6 Related Work

The need for good representations has long been recognised in AI mccarthy1987generality; russell2009artificial, and is fundamental to deep learning bengio2013representation. The importance of reusability and abstraction, especially in the context of transfer, is emphasised by Bengio, et al. bengio2013representation, who argue for feature sets that are “invariant to the irrelevant features and disentangle the relevant features”. Our work here shares this motivation. Other work has looked at learning representations that are disentangled at the feature level higgins2017beta; higgins2018scan. The novelty of the PrediNet is to incorporate architectural priors that favour representations that are disentangled at the relational and propositional levels. Previous work with relation nets and multi-head attention nets has shown how non-local information can be extracted from raw pixel data and used to solve tasks that require relational reasoning. santoro2017simple; palm2018recurrent; zambaldi2019deep But unlike the PrediNet, these networks don’t produce representations with an explicitly relational, propositional structure. By addressing the problem of acquiring structured representations, the PrediNet complements another thread of related work, which is concerned with learning how to carry out inference with structured representations, but which assumes the job of acquiring those representations is done elsewhere battaglia2016interaction; rocktaschel2017end; evans2018learning.

In part, the present work is motivated by the conviction that curricula will be essential to lifelong, continual learning in a future generation of RL agents if they are to exhibit more general intelligence, just as they are for human children. Curricular pre-training has a decade-long pedigree in deep learning bengio2009curriculum. Closely related to curriculum learning is the topic of transfer bengio2012deep, a hallmark of general intelligence and the subject of much recent attention higgins2017darla; kansky2017schema; schwarz2018progress. The PrediNet exemplifies a different (though not incompatible) viewpoint on curriculum learning and transfer from that usually found in the neural network literature. Rather than (or as well as) a means to guide the network, step by step, into a favourable portion of weight space, curriculum learning is here viewed in terms of the incremental accumulation of propositional knowledge. This necessitates the development of a different style of architecture, one that supports the acquisition of propositional, relational representations, which also naturally subserve transfer.

Asai asai2019unsupervised, whose paper was published while the present work was in progress, describes an architecture with some similarities to the PrediNet, but also some notable differences. For example, Asai’s architecture assumes an input representation in symbolic form where the objects have already been segmented. By contrast, in the present architecture, the input CNN and the PrediNet’s dot-product attention mechanism together learn what constitutes an object.

7 Conclusion and Further Work

We have presented a neural network architecture capable, in principle, of supporting predicate logic’s powers of abstraction without compromising the ideal of end-to-end learning, where the network itself discovers objects and relations in the raw data and thus avoids the symbol grounding problem entailed by symbolic AI’s practice of hand-crafting representations harnad1990symbol. Our empirical results support the view that a network architecturally constrained to learn explicitly propositional, relational representations will have beneficial data efficiency, generalisation, and transfer properties. But the findings reported here are just the first foray into unexplored architectural territory, and much work needs to be done to gauge the architecture’s full potential.

The focus of the present paper is the acquisition of propositional representations rather than their use. But thanks to the structural priors of its architecture, representations generated by a PrediNet module have a natural semantics compatible with predicate calculus (Equation 1), which makes them an ideal medium for logic-like downstream processes such as rule-based deduction, causal or counterfactual reasoning, and inference to the best explanation (abduction). One approach here would be to stack PrediNet modules and / or make them recurrent, enabling them to carry out the sort of iterated, sequential computations required for such processes palm2018recurrent; dehghani2018universal

. Another worthwhile direction for further research would be to develop reinforcement learning (RL) agents using the PrediNet architecture. One form of inference of particular interest in this context is model-based prediction, which can be used to endow an RL agent with look-ahead and planning abilities 

racaniere2017imagination; zambaldi2019deep. Our expectation is that RL agents in which explicitly propositional, relational representations underpin these capacities will manifest more of the beneficial data efficiency, generalisation, and transfer properties suggested by the present results. As a stepping stone to such RL agents, the Relations Game family of datasets could be extended into the temporal domain, and multi-task curricula developed to encourage the acquisition of temporal, as well as spatial, abstractions.

Acknowledgments

Thanks to our DeepMind colleagues, especially Irina Higgins, Neil Rabinowitz, David Reichert, David Raposo, Adam Santoro, and Daniel Zoran.

Appendix S1 Hyperparameters

Table S2

shows the default hyperparameters used for the experiments reported in the main text.

Parameter Value
Input images size
size
Runs per experiment
Optimiser Gradient descent
Learning rate
Batch size
Input CNN output channels
Input CNN filter size

Input CNN stride

Input CNN activation ReLu
Bias Yes
Output MLP hidden layer size
Output MLP output size ()
Output MLP activation ReLu
Bias Yes (both)
MLP1 output size
MLP1 activation ReLu
Bias Yes
MLP2 hidden layer size
MLP2 activations ReLu
MLP2 output size
Bias Yes (both)
RN MLP hidden layer size (pre-aggregation)
RN output size
RN activation ReLu
RN aggregation Element-wise mean
Bias No
MHA no. of heads
MHA key / query size
MHA value size
MHA output size
MHA attention mechanism
Bias No
PrediNet no. of heads
PrediNet key / query size
PrediNet relations
PrediNet output size
Bias n/a
Table S2: Default hyperparameters

Appendix S2 Supplementary Analysis

s2.1 Dimensionality reduction on intermediate representations

In order to qualitatively assess the nature of the representations of each architecture, we performed a dimensionality reduction analysis on the outputs of the central layer of a number of model architectures trained on the colour/shape task. After training, a batch of 10000 images (pentominoes) was passed through the network and PCA was performed on the resulting representations, which were then projected onto the two largest PCs for visualisation. The projected representations were then colour-coded by the labels for the corresponding images (i.e. different/same shape, different/same colour).

PCA on the full representations (concatenating the head outputs in the case of the PrediNet and MHA models) did not yield any clear clustering of representations according to the labels for any of the models (Figure S7).

For the PrediNet and MHA models, we also ran separate PCAs on the output relations of each head in order to see how distributed / disentangled the representations were across heads. While in the MHA model, there was no evidence of clustering by label on any of the heads, reflecting a heavily distributed representation, there were several heads in the PrediNet architecture that individually clustered the different labels (Figure

S8). In some heads, colour and shape seemed to be projected along separate axes (e.g. heads 1, 5, 10, 32), while in others objects with different colours seemed to be organised in a hexagonal grid (e.g. head 28).

It was also comforting to note that the clustering was preserved (though slightly compressed in PC space) when the held-out set of images (hexominoes) was passed through the PrediNet and projected onto the same principal components derived using the training set.

In Section S2.2, we show that the heads of the PrediNet that seem to cluster the labels also attend to the two objects in the image. Furthermore, in Section S2.3, we show that, at test time, the PrediNet can be pruned down to just a handful of these particular heads and exhibit performance comparable to the full network.

MLP1

MLP2

Relation Net

MHA

PrediNet

Figure S7: Central module outputs for networks trained on the colour/shape task when projected onto the two largest principal components.
PrediNet Training set
PrediNet Test set
MHA Training set
Figure S8: Per-head PCA on the heads of a PrediNet and an MHA trained on the colour/shape task. In all networks, the PCA was performed using the training data (pentominoes). In (a) and (c), the training data are projected onto the two largest PCs and in (b) the test data (hexominoes) was used.

s2.2 Attention analysis

In order to assess the extent to which the various PrediNet heads are attending to objects or the background, we produced a lower resolution content mask for each image (with the same resolution as the attention mask), which contained s at locations where there are no objects in the corresponding pixels of the full image, s where more than of the pixels contain an object, and s otherwise. By applying the attention mask on the content mask, and summing the resulting elements, we produced a scalar indicating whether the attention mask was selecting a region of the image with an object (value close to ), or the background (value close to ). This was tested over images from the training set (Fig. S9), but similar results are obtained if the held-out set images are used instead. The top plot in Fig. S9 shows that both attention masks of some heads consistently attend to objects, while others to a combination of object and background. Importantly, the heads for which the PCA meaningfully clusters the labels are also the ones in which both attention masks attend to objects (Fig. S8).

We additionally provide a similar analysis with a position mask, where each pixel in the mask contains a unique location index. The middle plot in Fig. S9 shows that the attention masks in the majority of the heads do not consistently attend to specific locations. Finally, the mean absolute values of the per head input weights to the output MLP are shown in the bottom plot of the same figure. Interestingly, the heads that consistently only attend to objects are given higher weighting than the rest.

Figure S9: Top: The extent to which the two attention masks of the different heads attend to objects rather than the background. Middle: The extent to which the two attention masks of the different heads attend to specific locations in the image. Bottom: Mean absolute value of the weights from the different PrediNet heads to the output MLP.

s2.3 Head pruning

Motivated by the PCA analysis that suggested that the PrediNet learns a more disentangled representation than the MHA across heads, we ran experiments to see how well the two architectures could perform using only a subset of the heads at test time. The two trained networks chosen (one PrediNet and one MHA) were the same ones used in the PCA analysis, which had both learnt the colour / shape task to a high degree of accuracy. First, we randomly sampled 100 subsets of heads for ; for each subset, we zeroed the outputs of the unchosen heads and evaluated the test accuracy on the masked network. In both the mean and best cases, the accuracy of the PrediNet increased much faster than that of the MHA as the number of heads was increased (Figure S10). Then, for the PrediNet, we ran the same experiments but only sampling from the 10 heads that we identified as attending to two objects in the image (which contain all those that clustered the labels in the PCA analysis (Section S2.1)). Remarkably, we found that only a handful of heads were needed to attain the accuracy of the full unmasked network, providing further evidence of functionally disentangled representations in the PrediNet.

Figure S10: Accuracy for PrediNet and Multi-head attention on the colour / shape task when random subsets of heads are used at test time. PrediNet* only samples from the heads attending to pairs of objects.

Appendix S3 Experimental Variations

Further details and results are provided in this section. Fig. S11 presents the test accuracy on the stripes object set, for which a summary is presented in Table 1 of the main text. Fig. S12 shows how the results on the same experiment but using Adam instead of SGD, with a learning rate of

. The TensorFlow default values for all other Adam parameters were used. While other learning rate values were also tested, a value of

gave the best overall performance for all architectures. Multi-task experiments were also performed using Adam with the same learning rate (Fig. S14 & S15), yielding an overall similar performance to SGD with a learning rate of .

To assess the extent to which the number of heads and relations plays a role in the performance, we run experiments with heads and relations (Fig. S16 & S17), as well as heads and (Fig. S18 & S19). The results indicate that having a greater number of heads leads to better performance than having a greater number of relations, as they provide more stability during training and a richer propositional representation.

Figure S11: Relations game learning curves for the different models. SGD with a learning rate of was used with a PrediNet of heads and relations.The results for the top batches are summarised in Table 1 of the main manuscript.
Figure S12: Relations game learning curves for the different models trained with Adam (LR: ). All other experimental parameters are the same as Fig. S11.
Figure S13: Multi-task curriculum training. The different columns correspond to different target / pre-training task combinations, while the different rows correspond to the different architectures. SGD with a learning rate of was used, and . Training was performed using the pentominoes object set and testing using the ‘stripes’ object set. From left to right, the combinations of target / pre-training tasks are: (match rows, 3 row patterns), (5 column patterns, between), (3 columns patterns, between), (match rows, 3 column patterns) and (5 row patterns, 5 column patterns). From top to bottom, the different architectures are: MLP1, MLP2, Relation net (RN), Multi-head attention (MHA) and PrediNet.
Figure S14: Multi-task curriculum training. The different columns correspond to different target / pre-training task combinations, while the different rows correspond to the different architectures, as in Fig. S13. Adam with a learning rate of was used. Training was performed using the pentominoes object set and testing using the ‘stripes’ object set.
Figure S15: Reusability of representations learned with a variety of target and pre-training tasks, using the ‘stripes’ object set. All architectures were trained using Adam, with a learning rate of . The experimental setup is the same as in Fig. S14.
Figure S16: Multi-task curriculum training. The different columns correspond to different target/pre-training task combinations, while the different rows correspond to the different architectures. SGD with a learning rate of was used. Training was performed using the pentominoes object set and testing using the stripes object set. The experimental setup is the same as for Fig. S13, except that and . Increasing the number of heads for the PrediNet increases the stability during training and overall performance.
Figure S17: Reusability of representations learned with a variety of target and pre-training tasks, using the ‘stripes’ object set. All experimental parameters are as in Fig. S16.
Figure S18: Multi-task curriculum training. The different columns correspond to different target / pre-training task combinations, while the different rows correspond to the different architectures. SGD with a learning rate of was used. Training was performed using the pentominoes object set and testing using the ‘stripes’ object set. The experimental setup is the same as for Fig. S13, except that and . Having less heads leads to a decrease in the performance, even if the number of relations increases to maintain network size.
Figure S19: Reusability of representations learned with a variety of target and pre-training tasks, using the ‘stripes’ object set. All experimental parameters are as in Fig. S18.