Logic and the 2-Simplicial Transformer

09/02/2019 ∙ by James Clift, et al. ∙ The University of Melbourne 4

We introduce the 2-simplicial Transformer, an extension of the Transformer which includes a form of higher-dimensional attention generalising the dot-product attention, and uses this attention to update entity representations with tensor products of value vectors. We show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 20

page 21

page 22

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has grown to incorporate a range of differentiable algorithms for computing with learned representations. The most successful examples of such representations, those learned by convolutional neural networks, are structured by the scale and translational symmetries of the underlying space (e.g. a two-dimensional Euclidean space for images). It has been suggested that in humans the ability to make rich inferences based on abstract reasoning is rooted in the same neural mechanisms underlying relational reasoning in space [16, 19, 6, 7] and more specifically that abstract reasoning is facilitated by the learning of structural representations which serve to organise other learned representations in the same way that space organises the representations that enable spatial navigation [68, 41]. This raises a natural question: are there any ideas from mathematics that might be useful in designing general inductive biases for learning such structural representations?

As a motivating example we take the recent progress on natural language tasks based on the Transformer architecture [66] which simultaneously learns to represent both entities (typically words) and relations between entities (for instance the relation between “cat” and “he” in the sentence “There was a cat and he liked to sleep”). These representations of relations take the form of query and key vectors governing the passing of messages between entities; messages update entity representations over several rounds of computation until the final representations reflect not just the meaning of words but also their context in a sentence. There is some evidence that the geometry of these final representations serve to organise word representations in a syntax tree, which could be seen as the appropriate analogue to two-dimensional space in the context of language [33].

The Transformer may therefore be viewed as an inductive bias for learning structural representations which are graphs, with entities as vertices and relations as edges. While a graph is a discrete mathematical object, there is a naturally associated topological space which is obtained by gluing -simplices (copies of the unit interval) indexed by edges along -simplices (points) indexed by vertices. There is a general mathematical notion of a simplicial set which is a discrete structure containing a set of -simplices for all together with an encoding of the incidence relations between these simplices. Associated to each simplicial set is a topological space, obtained by gluing together vertices, edges, triangles (-simplices), tetrahedrons (-simplices), and so on, according to the instructions contained in the simplicial set. Following the aforementioned works in neuroscience [16, 19, 6, 7, 68, 41] and their emphasis on spatial structure, it is natural to ask if a simplicial inductive bias for learning structural representations can facilitate abstract reasoning.

With this motivation, we begin in this paper an investigation of simplicial inductive biases for abstract reasoning in neural networks, by giving a simple method for incorporating -simplices (which relate three entities) into the existing Transformer architecture. We call this the -simplicial Transformer block. It has been established in recent work [52, 69, 67] that relational inductive biases are useful for solving problems that draw on abstract reasoning in humans. In Section 5 we show that when embedded in a deep reinforcement learning agent our -simplicial Transformer block confers an advantage over the ordinary Transformer block in an environment with logical structure, and on this basis we argue that further investigation of simplicial inductive biases is warranted.

What is the -simplicial Transformer block? In each iteration of a standard Transformer block a sequence of entity representations are first multiplied by weight matrices to obtain query, key and value vectors for . Updated value vectors are then computed according to the rule

(1)

These value vectors (possibly concatenated across multiple heads) are then passed through a feedforward network and layer normalisation [3] to compute updated representations which are the outputs of the Transformer block. In each iteration of a -simplicial Transformer block the updated value vector also depends on higher-order terms

(2)

where is a second sequence of vectors derived from the entity representation via multiplication with weight matrices, is a weight tensor and the scalars are the -simplicial attention

, viewed as logits which predict the existence of a

-simplex with vertices . The scalar triple product can be written explicitly in terms of the pairwise dot products of (see Definition 2.5).

If the set of possible -simplices is unconstrained the -simplicial Transformer block has time complexity as a function of the number of entities , and this is impractical. To reduce the time complexity we introduce a set of virtual entities which are updated by the ordinary attention, and restrict our -simplices to have be virtual entities; effectively we use the ordinary Transformer to predict which entities should participate in -simplices. Taking to be of the order of gives the -simplicial Transformer block the same complexity as the ordinary Transformer. For a full specification of the -simplicial Transformer block see Section 2.2.

The architecture of our deep reinforcement learning agent largely follows [69] and the details are given in Section 4. The key difference between our simplicial agent and the relational agent of [69] is that in place of a standard Transformer block we use a -simplicial Transformer block. Our use of tensor products of value vectors is inspired by the semantics of linear logic in vector spaces [25, 47, 14]

in which an algorithm with multiple inputs computes on the tensor product of those inputs, but this is an old idea in natural language processing, used in models including the second-order RNN

[22, 50, 27, 23], multiplicative RNN [62, 36], Neural Tensor Network [60] and the factored

-way Restricted Boltzmann Machine

[51], see Appendix D. More recently tensors have been used to model predicates in a number of neural network architectures aimed at logical reasoning [55, 18]. The main novelty in our model lies in the introduction of the -simplicial attention, which allows these ideas to be incorporated into the Transformer architecture.

What is the environment? The environment in our reinforcement learning problem is a variant of the BoxWorld environment from [69]. The original BoxWorld is played on a rectangular grid populated by keys and locked boxes, with the goal being to open the box containing the Gem (represented by a white square) as shown in the sample episode of Figure 1. Locked boxes are drawn as two consecutive pixels with the lock on the right and the contents of the box (another key) on the left. Each key can only be used once. In the episode shown there is a loose pink key (marked ) which can be used to open one of two locked boxes, obtaining in this way either key or key 111The agent sees only the colours of tiles, not the numbers which are added here for exposition.. The correct choice is , since this leads via the sequence of keys to the Gem. All other possibilities (referred to as distractor boxes) lead to a board configuration in which the player is unable to obtain the Gem; for further details, including the shaping of rewards, see Section 3.1.

Our variant of the BoxWorld environment, bridge BoxWorld, is shown in Figure 2. In each episode two keys are now required to obtain the Gem, and there are therefore two loose keys on the board. Beginning at each loose key is a solution path leading to one of the keys required to open the box containing the Gem, and the eponymous bridges allow the player to cross between solution paths, thereby rendering the puzzle unsolvable. For instance, in Figure 2 opening the box marked “bridge” uses up the orange key, so it is only possible to obtain one of the two keys needed to open the box containing the Gem. To reach the Gem, the player must therefore learn to recognise and avoid these bridges. For further details of the bridge BoxWorld environment see Section 3.2.

Figure 1: Right: a sample episode of the BoxWorld environment. Light gray tiles represent the floor, the dark gray tile is the player, the white tile is the Gem, and the rightmost column is the player inventory, currently empty. Left: graph representation of the puzzle, with key colours as vertices and an arrow if key can be used to obtain key .
Figure 2: Right: a sample episode of the bridge BoxWorld environment, in which the Gem has two locks and there is a marked bridge. Left: graph representation of the puzzle, with upper and lower solutions paths and the bridge between them.

What is the logical structure of the environment? The design of the BoxWorld environment was intended to stress the planning and reasoning components of an agent’s policy [69, p.2] and for this reason it is the underlying logical structure of the environment (rather than its representation in terms of coloured keys) that is of central importance. To explain this logical structure we introduce the following notation: given a colour , we use to stand for the proposition that a key of this colour is obtainable.

Each episode expresses its own set of basic facts, or axioms, about obtainability. For instance, a loose key of colour gives as an axiom, and a locked box requiring a key of colour in order to obtain a key of colour gives an axiom that at first glance appears to be the implication of classical logic. However, since a key may only be used once, this is actually incorrect; instead the logical structure of this situation is captured by the linear implication of linear logic [25]. With this understood, each episode of the original BoxWorld provides in visual form a set of axioms such that a strategy for obtaining the Gem is equivalent to a proof of in intuitionistic linear logic, where stands for the proposition that the Gem is obtainable. There is a general correspondence in logic between strategies and proofs, which we recall in Appendix B.

To describe the logical structure of bridge BoxWorld we need to encode the fact that two keys (say a green key and a blue key) are required to obtain the Gem. It is the linear conjunction of linear logic (also called the tensor product) rather than the conjunction of classical logic that properly captures the semantics. The axioms encoded in an episode of bridge BoxWorld contain a single formula of the form where are the colours of the keys on the Gem, and again a strategy is equivalent to a proof of . In conclusion, the logical structure of the original BoxWorld consists of a fragment of linear logic containing only the connective , while bridge BoxWorld captures a slightly larger fragment containing and . The problem faced by the agent is to learn, purely through interaction, this underlying logical structure.

Acknowledgements. We acknowledge support from the Nectar cloud at the University of Melbourne and GCP research credits.

2 -Simplicial Transformer

The Transformer architecture introduced in [66] builds on a history of attention mechanisms that in the context of natural language processing go back to [4]. For general surveys of soft attention mechanisms in deep learning see [13, 49] and [26, §12.4.5.1]. The fundamental idea, of propagating information between nodes using weights that depend on the dot product of vectors associated to those nodes, comes ultimately from statistical mechanics via the Hopfield network, see Remark 2.2. We distinguish between the Transformer architecture which contains a word embedding layer, an encoder and a decoder, and the Transformer block which is the sub-model of the encoder that is repeated.

In this section we first review the definition of the ordinary Transformer block (Section 2.1) and then explain the -simplicial Transformer block (Section 2.2). Both blocks define operators on sequences of entity representations. Strictly speaking the entities are indices but we sometimes identify the entity with its representation . The space of entity representations is denoted , while the space of query, key and value vectors is denoted . We use only the vector space structure on , but is an inner product space with the usual dot product pairing and in defining the -simplicial Transformer block we will use additional algebraic structure on , including the “multiplication” tensor of (15) (used to propagate tensor products of value vectors) and the Clifford algebra of (used to define the -simplicial attention).

2.1 Transformer block

In the first step of the standard Transformer block we generate from each entity

a tuple of vectors via a learned linear transformation

. These vectors are referred to respectively as query, key and value vectors and we write

(3)

Stated differently, for weight matrices . In the second step we compute a refined value vector for each entity

(4)

Finally, the new entity representation is computed by the application of a feedforward network , layer normalisation [3] and a skip connection

(5)

We refer to this form of attention as -simplicial attention.

Remark 2.1 .

At the beginning of training the query and key vectors are random vectors in , which are orthogonal in expectation if is sufficiently large. We therefore expect that without any training . Suppose that for each entity there is a single entity with information useful for the training objective. Then as learning progresses the random configuration will vary so that the query for entity lies in the same direction as the key of and is orthogonal to for . Then will dominate the distribution and .

Remark 2.2 .

The continuous Hopfield network [34] [43, Ch.42] with nodes updates in each timestep a sequence of vectors by the rules

(6)

for some parameter . The Transformer block may therefore be viewed as a refinement of the Hopfield network, in which the three occurrences of entity vectors in (6) are replaced by query, key and value vectors respectively, the nonlinearity is replaced by a feedforward network with multiple layers, and the dynamics are stabilised by layer normalisation. The initial representations also incorporate information about the underlying lattice, via the positional embeddings.

Remark 2.3 .

In multiple-head attention with heads, there are channels along which to propagate information between every pair of entities, each of dimension . More precisely, we choose a decomposition so that

and write

To compute the output of the attention, we take a direct sum of the value vectors propagated along every one of these channels, as in the formula

(7)
Remark 2.4 .

In the introduction we referred to the idea that a Transformer model learns representations of relations. To be more precise, these representations are heads, each of which determines an independent set of transformations which extract queries, keys and values from entities. Thus a head determines not only which entities are related (via ) but also what information to transmit between them (via ).

The idea that the structure of a sentence acts to transform the meaning of its parts is due to Frege [21] and underlies the denotational semantics of logic. From this point of view the Transformer architecture is an inheritor both of the logical tradition of denotational semantics, and of the statistical mechanics tradition via Hopfield networks.

2.2 -Simplicial Transformer block

In combinatorial topology the canonical one-dimensional object is the -simplex (or edge) and the canonical two-dimensional object is the -simplex (or triangle) which we may represent diagrammatically in terms of indices as

(8)

It is natural to apply the theory of simplicial sets and simplicial complexes, which are the mathematical objects which organise collections of

-simplices, in the context of learning representations in computer vision and machine learning, and indeed this has been done

[11, 39, 8]. To the extent that our approach differs from these earlier works, the difference lies in the fact that our simplices arise from the geometric algebra of configurations of query and key vectors, which seems to be an emerging idiom within deep learning.

In the -simplicial Transformer block, in addition to the -simplicial contribution, each entity is updated as a function of pairs of entities using the tensor product of value vectors

and a probability distribution derived from a scalar triple product

in place of the scalar product . This means that we associate to each entity a four-tuple of vectors via a learned linear transformation , denoted

(9)

We still refer to as the query, as the keys and as the value. Stated differently, and for weight matrices .

Definition 2.5 .

The unsigned scalar triple product of is

(10)

whose square is a polynomial in the pairwise dot products

(11)

This scalar triple product has a simple geometric interpretation in terms of the volume of the tetrahedron with vertices . To explain, recall that the triangle spanned by two unit vectors in has an area given by the following formula:

In three dimensions, the analogous formula involves the volume of the tetrahedron with vertices given by unit vectors and the scalar triple product:

In general, given nonzero vectors let denote unit vectors in the same directions. We can by Lemma A.10(v) factor out the length in the scalar triple product

(14)

so that a general scalar triple product can be understood in terms of the vector norms and configurations of three points on the -sphere. One standard approach to calculating volumes of such tetrahedrons is the cross product which is only defined in three dimensions. Since the space of representations is high dimensional the natural framework for the triple scalar product is instead the Clifford algebra of (see Appendix A).

For present purposes, we need to know that attains its minimum value (which is zero) when are pairwise orthogonal, and attains its maximum value (which is ) if and only if is linearly dependent (Lemma A.10). Using the number as a measure of the degree to which entity is attending to , or put differently, the degree to which the network predicts the existence of a -simplex , the update rule for the entities when using purely -simplicial attention is

(15)

where is a learned linear transformation. Although we do not impose any further constraints, the motivation here is to equip with the structure of an algebra; in this respect we model conjunction by multiplication, an idea going back to Boole [9].

We compute multiple-head -simplicial attention in the same way as in the -simplicial case. To combine -simplicial heads (that is, ordinary Transformer heads) and -simplicial heads we use separate inner product spaces for each simplicial dimension, so that there are learned linear transformations and the queries, keys and values are extracted from an entity according to

The update rule (for a single head in each simplicial dimension) is then:

(16)
(17)

Regarding the layer normalisation on the output of the -simplicial head see Remark 4.1. If there are heads of -simplicial attention and heads of -simplicial attention, then (16) is modified in the obvious way using and .

The time complexity of -simplicial attention as a function of the number of entities is while the time complexity of -simplicial attention is since we have to calculate the attention for every triple of entities. For this reason we consider only triples where the base of the -simplex is taken from a set of pairs predicted by the ordinary attention, which we view as the primary locus of computation. More precisely, we introduce in addition to the entities (now referred to as standard entities) a set of virtual entities

. These virtual entities serve as a “scratch pad” onto which the iterated ordinary attention can write representations, and we restrict

to lie in the range so that only value vectors obtained from virtual entities are propagated by the -simplicial attention.

With virtual entities the update rule is for

(18)

and for

(19)

The updated representation is computed from using (17) as before. Observe that the virtual entities are not used to update the standard entities during -simplicial attention and the -simplicial attention is not used to update the virtual entities; instead the second summand in (19) involves the vector , which adds recurrence to the update of the virtual entities. After the attention phase the virtual entities are discarded.

The method for updating the virtual entities is similar to the role of the memory nodes in the relational recurrent architecture of [53], the master node in [24, §5.2]

and memory slots in the Neural Turing Machine

[28]. The update rule has complexity and so if we take to be of order we get the desired complexity .

Remark 2.6 .

In the dot product attention the norms of the queries and keys affect the distribution

(20)

in different ways. Since is fixed, acts like the inverse temperature in the Boltzmann distribution: increasing the query norm decreases the entropy of the distribution. On the other hand, if is large then, all else being equal, any entity which attends to will do so strongly, and in this sense the key norm is a measure of the importance of entity . Using (14) a similar interpretation applies to the -simplicial attention.

3 Environment

Our environment is an extension of the BoxWorld environment of [69], see also [53, 30], implemented as an OpenAI gym environment [10]. We begin by explaining our implementation of the original BoxWorld, and then we explain the extension used in our experiments. The code for our implementation of the original BoxWorld environment, and of bridge BoxWorld, is available online [15].

3.1 Standard BoxWorld

The standard BoxWorld environment is a rectangular grid in which are situated the player (a dark gray tile) and a number of locked boxes represented by a pair of horizontally adjacent tiles with a tile of colour , the key colour, on the left and a tile of colour , the lock colour, on the right. There is also one loose key in each episode, which is a coloured tile not initially adjacent to any other coloured tile. All other tiles are blank (light gray) and are traversable by the player. The rightmost column of the screen is the inventory, which fills from the top and contains keys that have been collected by the player. The player can pick up any loose key by walking over it. In order to open a locked box, with key and lock colours as above, the player must step on the lock while in possession of a copy of , in which case one copy of this key is removed from the inventory and replaced by a key of colour . The goal is to attain a white key, referred to as the Gem.

Some locked boxes, if opened, provide keys that are not useful for attaining the Gem. Since each key may only be used once, opening such boxes means the episode is rendered unsolvable. Such boxes are called distractors. An episode ends when the player either obtains the Gem (with a reward of ) or opens a distractor box (reward ). Opening any non-distractor box, or picking up a loose key, garners a reward of . The solution length is the number of locked boxes (including the one with the Gem) in the episode on the path from the loose key to the Gem. The episode in Figure 1 has solution length four. Episodes are parametrised by the solution length and the number of distractors.

3.2 Bridge BoxWorld

In our extension of the original BoxWorld environment, the box containing the Gem has two locks (of different colours). To obtain the Gem, the player must step on either of the lock tiles with both keys in the inventory, at which point the episode ends with the usual reward. Graphically, Gems with multiple locks are denoted with two vertical white tiles on the left, and the two lock tiles on the right; see Figure 2.

Two solution paths (of the same length) leading to each of the locks on the Gem are generated with no overlapping colours, beginning with two loose keys. In episodes with multiple locks we do not consider distractor boxes of the old kind; instead there is a new type of distractor that we call a bridge. This is a locked box whose lock colour is taken from one solution branch and whose key colour is taken from the other branch. Opening the bridge renders the puzzle unsolvable. An episode ends when the player either obtains the Gem (reward ) or opens a bridge (reward ). Opening a box other than the bridge, or picking up a loose key, has a reward of as before. In this paper we consider episodes with zero or one bridge (the player cannot fail to solve an episode with no bridge).

4 Agent

Our baseline relational agent is modeled closely on [69] except that we found that a different arrangement of layer normalisations worked better in our experiments, see Remark 4.2. The code for our implementation of both agents is available online [15].

4.1 Basic architecture

In the following we describe the network architecture of both the relational and simplicial agent; we will note the differences between the two models as they arise. The input to the agent’s network is an RGB image, represented as a tensor of shape (i.e. an element of ) where is the number of rows and the number of columns (the is due to the inventory). This tensor is divided by and then passed through a convolutional layer with features, and then a convolutional layer with

features. Both activation functions are ReLU and the padding on our convolutional layers is “valid” so that the output has shape

. We then multiply by a weight matrix of shape to obtain a tensor of shape . Each feature vector has concatenated to it a two-dimensional positional encoding, and then the result is reshaped into a tensor of shape where is the number of Transformer entities. This is the list of entity representations .

In the case of the simplicial agent, a further two learned embedding vectors are added to this list; these are the virtual entities. So with in the case of the relational agent and for the simplicial agent, the entity representations form a tensor of shape . This tensor is then passed through two iterations of the Transformer block (either purely -simplicial in the case of the relational agent, or including both and -simplicial attention in the case of the simplicial agent). In the case of the simplicial agent the virtual entities are then discarded, so that in both cases we have a sequence of entities .

To this final entity tensor we apply max-pooling over the entity dimension, that is, we compute a vector

by the rule for . This vector is then passed through four fully-connected layers with hidden nodes and ReLU activations. The output of the final fully-connected layer is multiplied by one weight matrix to produce logits for the actions (left, up, right and down) and another weight matrix to produce the value function.

4.2 Transformer blocks

The input to our Transformer blocks are tensors of shape , and the outputs have the same shape. Inside each block are two feedforward layers separated by a ReLU activation with hidden nodes; the weights are shared between iterations of the Transformer block. The pseudo-code for the ordinary Transformer block inside the relational agent is:

def transformer_block(e):
    x = LayerNorm(e)
    a = 1SimplicialAttention(x)
    b = DenseLayer1(a)
    c = DenseLayer2(b)
    r = Add([e,c])
    eprime = LayerNorm(r)
    return eprime

In the -simplicial Transformer block the input tensor, after layer normalisation, is passed through the -simplicial attention and the result (after an additional layer normalisation) is concatenated to the output of the -simplicial attention heads before being passed through the feedforward layers:

def simplicial_transformer_block(e):
    x = LayerNorm(e)
    a1 = 1SimplicialAttention(x)
    a2 = 2SimplicialAttention(x)
    a2n = LayerNorm(a2)
    ac = Concatenate([a1,a2n])
    b = DenseLayer1(ac)
    c = DenseLayer2(b)
    r = Add([e,c])
    eprime = LayerNorm(r)
    return eprime

Our implementation of the standard Transformer block is based on an implementation in Keras from

[46]. In the reported experiments we use only two Transformer blocks; we performed two trials of a relational agent using four Transformer blocks, but after timesteps neither trial exceeded the plateau in terms of fraction solved. In both the relational and simplicial agent, the space of entity representations has dimension and we denote by the spaces of -simplicial and -simplicial queries, keys and values. In both the relational and simplicial agent there are two heads of -simplicial attention, with . In the simplicial agent there is a single head of -simplicial attention with and two virtual entities.

Remark 4.1 .

Without the additional layer normalisation on the output of the -simplicial attention we find that training is unstable. The natural explanation is that these outputs are constructed from polynomials of higher degree than the -simplicial attention, and thus computational paths that go through the -simplicial attention will be more vulnerable to exploding or vanishing gradients. Ignoring denominators in the softmax and layer normalisations, the contribution to from the -simplicial head in (17),(18) is

(21)

while the contribution from the -simplicial head is

(22)

The lowest order term in (21) is which is linear in the components of the entity vector , whereas the lowest order term of (22) is which is quadratic.

Remark 4.2 .

There is wide variation in the use of layer normalisation in the literature on Transformer models, compare [66, 12, 69]. The architecture described in [69] involves layer normalisation in two places: on the concatenation of the matrices, and on the output of the feedforward network . We keep this second normalisation but move the first from after the linear transformation of (3) to before this linear transformation, so that it is applied directly to the incoming entity representations.

We found that this works well, but the arrangement is strange in that there are two consecutive layer normalisations between iterations of the Transformer block. Note that if two layer normalisations with bias-gain pairs are applied in succession, then the output of the first normalisation layer on input with mean

and variance

will be and passing this through the second normalisation layer yields the output which is equivalent to a single layer normalisation with a pair . It is possible the appearance of the parameter in a nonlinear way provides a useful reparametrisation of the network [26, §8.7.1].

5 Experiments

The training of our agents uses the implementation in Ray RLlib [40] of the distributed off-policy actor-critic architecture IMPALA of [20]

with optimisation algorithm RMSProp. The hyperparameters for IMPALA and RMSProp are given in Table

1. Following [69] and other recent work in deep reinforcement learning we use RMSProp with a large value of the hyperparameter , which is a priori quite strange. However, as we explain in Appendix E

, this is effectively a variant of RMSProp with smoothed gradient clipping.

Hyperparameter Value
IMPALA entropy
Discount factor 0.99
Unroll length 40 timesteps
Batch size 1280 timesteps
Learning rate
RMSProp momentum 0
RMSProp 0.1
RMSProp decay 0.99

Table 1: Hyperparameters for agent training.

First we verified that our implementation of the relational agent can solve the standard BoxWorld environment [69] with a solution length sampled from and number of distractors sampled from on a grid. After training for timesteps our implementation solved over of puzzles (regarding the discrepancy with the reported sample complexity in [69] see Remark 5.1). Next we trained the relational and simplicial agent on bridge BoxWorld, under the following conditions: half of the episodes contain a bridge, the solution length is uniformly sampled from (both solution paths are of the same length), colours are uniformly sampled from a set of colours222Saturation , brightness and hue for . and the boxes and loose keys are arranged randomly on a grid, under the constraint that the box containing the Gem does not occur in the rightmost column or bottom row, and keys appear only in positions for . The starting and ending point of the bridge are uniformly sampled with no restrictions (e.g. the bridge can involve the colours of the loose keys and locks on the Gem) but the lock colour is always on the top solution path. There is no curriculum and no cap on timesteps per episode. We trained four independent trials of both agents to either timesteps or convergence, whichever came first. The training runs for the relational and simplicial agents are shown in Figure 4 and Figure 5 respectively. In Figure 3

we give the mean and standard deviation of these four trials of each agent, showing a clear advantage of the simplicial agent. We make some remarks about performance comparisons taking into account the fact that the relational agent is simpler (and hence faster to execute) than the simplicial agent in Appendix

C.

Figure 3: Training curve of mean relational and simplicial agents on bridge BoxWorld. Shown are the mean and standard deviation of four runs of each agent.
Figure 4: Training curves for the relational agent on bridge BoxWorld.
Figure 5: Training curves for the simplicial agent on bridge BoxWorld.
Remark 5.1 .

The experiments in the original BoxWorld paper [69] contain an unreported cap on timesteps per episode (an episode horizon) of timesteps [54]. We have chosen to run our experiments without an episode horizon, and since this means our reported sample complexities diverge substantially from the original paper (some part of which it seems reasonable to attribute to the lack of horizon) it is necessary to justify this choice.

When designing an architecture for deep reinforcement learning the goal is to reduce the expected generalisation error [26, §8.1.1] with respect to some class of similar environments. Although this class is typically difficult to specify and is often left implicit, in our case the class includes a range of visual logic puzzles involving spatial navigation, which can be solved without memory333The bridge is the unique box both of whose colours appear three times on the board. However, this is not a reliable strategy for detecting bridges for an agent without memory, because once the agent has collected some of the keys on the board, some of the colours necessary to make this deduction may no longer be present.. A learning curriculum undermines this goal, by making our expectations of generalisation conditional on the provision of a suitable curriculum, whose existence for a given member of the problem class may not be clear in advance. The episode horizon serves as a de facto curriculum, since early in training it biases the distribution of experience rollouts towards the initial problems that an agent has to solve (e.g. learning to pick up the loose key). In order to avoid compromising our ability to expect generalisation to similar puzzles which do not admit such a useful curriculum, we have chosen not to employ an episode horizon. Fortunately, the relational agent performs well even without a curriculum on the original BoxWorld, as our results show.

Remark 5.2 .

Experiments were conducted either on the Google Cloud Platform with a single head node with 12 virtual CPUs and one NVIDIA Tesla P100 GPU and 192 additional virtual CPUs spread over two pre-emptible worker nodes, or on the University of Melbourne Nectar research cloud with a single head node with 12 virtual CPUs and two NVIDIA Tesla K80 GPUs, and 222 worker virtual CPUs.

6 Analysis

We analyse the simplicial agent, with two main goals: firstly, to establish that the agent has actually learned to use the -simplicial attention, and secondly to examine the hypothesis that the agent has learned a form of logical reasoning. The results of our analysis are positive in the first case but inconclusive in the second: while the agent has clearly learned to use both the -simplicial and -simplicial attention, we are unable to identify the structure in the attention as a homomorphic image of a logically correct explicit strategy, and so we leave as an open question whether the agent is performing logical reasoning according to the standard elaborated in Appendix B.

It will be convenient to organise episodes of bridge BoxWorld by their puzzle type, which is the tuple where is the solution length, is the bridge source and is the bridge target, with indices increasing with the distance from the gem. For example, the episode in Figure 2 has puzzle type . Throughout this section simplicial agent means simplicial agent A of Figure 5.

6.1 Attention

We provide a preliminary analysis of the attention of the trained simplicial agent, with an aim to answer the following questions: which standard entities attend to which other standard entities? What do the virtual entities attend to? Is the -simplicial attention being used? Our answers are anecdotal, based on examining visualisations of rollouts; we leave a more systematic investigation of the agent’s strategy to future work.

The analysis of the agent’s attention is complicated by the fact that our convolutional layers (of which there are two) are not padded, so the number of entities processed by the Transformer blocks is where the original game board is and there is an extra column for the inventory (here is the number of rows). This means there is not a one-to-one correspondence between game board tiles and entities; for example, all the experiments reported in Figure 3 are on a board, so that there are Transformer entities which can be arranged on a grid (information about this grid is passed to the Transformer blocks via the positional encoding). Nonetheless we found that for trained agents there is a strong relation between a tile in position and the Transformer entity with index for . This correspondence is presumed in the following analysis, and in our visualisations.

Across our four trained simplicial agents, the roles of the virtual entities and heads vary: the following comments are all in the context of the best simplicial agent (simplicial agent A of Figure 5) but we observe similar patterns in the other trials.

6.1.1 -simplicial attention of standard entities

Figure 6: Visualisation of -simplicial attention in first Transformer block, between standard entities in heads one and two. The vertical axes on the second and third images are the query index , the horizontal axes are the key index .

The standard entities are now indexed by and virtual entities by . In the first iteration of the -simplicial Transformer block, the first -simplicial head appears to propagate information about the inventory. At the beginning of an episode the attention of each standard entity is distributed between entities (the entities in the rightmost column), it concentrates sharply on (the entity closest to the first inventory slot) after the acquisition of the first loose key, and sharply on after the acquisition of the second loose key. The second -simplicial head seems to acquire the meaning described in [69], where tiles of the same colour attend to one another. A typical example is shown in Figure 6. The video of this episode is available online [15].

6.1.2 -simplicial attention

The standard entities are updated using -simplices in the first iteration of the -simplicial Transformer block, but this is not interesting as initially the virtual entities are learned embedding vectors, containing no information about the current episode. So we restrict our analysis to the -simplicial attention in the second iteration of the Transformer block. In brief, we observe that the agent has learned to use the -simplicial attention to direct tensor products of value vectors to specific query entities. In a typical timestep most query entities attend strongly to a common pair (which is in Figures 7 and 9 and in Figure 8), and we refer to this attention as generic. The top and bottom locks on the Gem, the player, and the entities associated to the inventory are often observed to have a non-generic -simplicial attention, and some of the relevant -simplices are drawn in the aforementioned figures.

Figure 7: The -simplicial attention of the virtual entities in the first iteration (first and second row, second and third column) and -simplicial attention in the second iteration, in step of an episode of puzzle type . Entity is the top lock on the Gem, is the bottom lock on the Gem, is the player. Shown is a -simplex with query entity . In the visualisations of the -simplicial attention, the rows are query entities and the columns are key entities . In the visualisation of the -simplicial attention, the columns are query entities and rows are key entity pairs in lexicographic order .
Figure 8: Visualisation of the -simplicial attention in the second Transformer block in step of an episode of puzzle type . Entity is the top lock on the Gem, is associated with the inventory, is the lock directly below the player. Shown is a -simplex with target .
Figure 9: Visualisation of the -simplicial attention in the second Transformer block in step of an episode of puzzle type . Entity is associated with the inventory, is the player. Shown is a -simplex with target .

To give more details we must first examine the content of the virtual entities after the first iteration, which is a function of the -simplicial attention of the virtual entities in the first iteration. In Figures 7, 8, 9 we show these attention distributions multiplied by the pixels in the region of the original board, in the second and third columns of the second and third rows.444For visibility in print the -simplicial attention of the virtual entities in these figures has been sharpened, by multiplying the logits by . The -simplicial attention and -simplicial attention of standard entities have not been sharpened. In this connection, we remark that in Figure 7 there is one entity whose unsharpened attention coefficient for the first virtual entity in the first head is more than one standard deviation above the mean, and there are two such entities for the second virtual entity and second head. Let and denote the initial representations of the first and second virtual entities, before the first iteration. We use the index to stand for a virtual entity. In the first iteration the representations are updated by (19) to

(23)

where the sum is over all entities , the are the attention coefficients of the first -simplicial head and the coefficients are the attention of the second -simplicial head. Writing for the zero vector in respectively, this can be written as

(24)

For a query entity the vector propagated by the -simplicial part of the second iteration has the following terms, where

(25)

Here is the -simplicial attention with logits associated to .

The tuple is the th column in our visualisations of the -simplicial attention, so in the situation of Figure 7 with we have and hence the output of the -simplicial head used to update the entity representation of the bottom lock on the Gem is approximately . If we ignore the layer normalisation, feedforward network and skip connection in (24) then and so that the output of the -simplicial head with target is approximately

(26)

Following Boole [9] and Girard [25] it is natural to read the “product” (26) as a conjunction (consider together the entity and the entity ) and the sum in (25) as a disjunction. An additional layer normalisation is applied to this vector, and the result is concatenated with the incoming information for entity from the -simplicial attention, before all of this is passed through (17) to form .

Given that the output of the -simplicial head is the only nontrivial difference between the simplicial and relational agent (with a transformer depth of two, the first -simplicial Transformer block only updates the standard entities with information from embedding vectors) the performance differences reported in Figure 3 suggest that this output is informative about avoiding bridges.

6.2 The plateau

In the training curves of the agents of Figure 4 and Figure 5 we observe a common plateau at a win rate of . In Figure 10 we show the per-puzzle win rate of simplicial agent A and relational agent A, on puzzles. These graphs make clear that the transition of both agents to the plateau at is explained by solving the type (and to a lesser degree by progress on all puzzle types with ). In Figure 10 and Figure 11 we give the per-puzzle win rates for a small sample of other puzzle types. Shown are the mean and standard deviation of 100 runs across various checkpoints of simplicial agent A and relational agent A.

Figure 10: Simplicial and relational agent win rate on puzzle types .
Figure 11: Simplicial and relational agent win rate on puzzle types .

7 Discussion

Motivated by the idea that abstract reasoning in humans is grounded in structural representations that are adapted from those evolved for spatial reasoning, we have presented a simplicial inductive bias. We have shown that in the context of a deep reinforcement learning environment with nontrivial logical structure, this bias is superior to a purely relational inductive bias. In this concluding section we briefly address some of the limitations of our work, and future directions.

Limitations. Our experiments involve only a small number of virtual entities, and a small number of iterations of the Transformer block: it is possible that for large numbers of virtual entities and iterations, our choices of layer normalisation are not optimal. Our aim was to test the viability of the simplicial Transformer starting with the minimal configuration, so we have also not tested multiple heads of -simplicial attention.

Deep reinforcement learning is notorious for poor reproducibility [31], and in an attempt to follow the emerging best practices we are releasing our agent and environment code, trained agent weights, and training notebooks [15].

Future directions. It is clear using the general formulas for the unsigned scalar product how to define an -simplicial Transformer block, and this is arguably an idiomatic expression in the context of deep learning of the linear logic semantics of the connective. It would be interesting to extend this to include other connectives, in environments encoding a larger fragment of linear logic proofs. However, at present this seems out of reach because the complexity makes scaling to much larger environments impractical. We hope that some of the scaling work being done in the Transformer literature can be adapted to the simplicial Transformer; see for example [12].

Appendix A Clifford algebra

The volume of an -simplex in with vertices at is

which is times the volume of the -dimensional parallelotope which shares edges with the -simplex. In our applications the space of representations is high dimensional, but we wish to speak of the volume of -simplices for and use those volumes to define the coefficients of our simplicial attention. The theory of Clifford algebras [32] is one appropriate framework for such calculations.

Let be an inner product space with pairing . The Clifford algebra is the associative unital -algebra generated by the vectors with relations

The canonical -linear map is injective, and since in , any nonzero vector is a unit in the Clifford algebra. While as an algebra is only -graded, there is nonetheless a -grading of the underlying vector space which can be defined as follows: let be an orthonormal basis of , then the set

is a basis for , with ranging over the set . If we assign the basis element the degree then this determines a -grading of the Clifford algebra, which is easily checked to be independent of the choice of basis.

Definition A.1 .

denotes the homogeneous component of of degree .

Example A.2 .

Given we have and

(27)

There is an operation on elements of the Clifford algebra called reversion in geometric algebra [32, p.45] which arises as follows: the opposite algebra admits a -linear map with which satisfies , and so by the universal property there is a unique morphism of algebras

which restricts to the identity on . Note for and is homogeneous of degree zero with respect to the -grading. Using this operation we can define the magnitude [32, p.46] of any element of the Clifford algebra.

Definition A.3 .

The magnitude of is .

For vectors ,

(28)

and in particular for we have .

Lemma A.4.

Set . Then for we have

Proof.

See [32, (1.33)]. ∎

Example A.5 .

For the lemma gives

and hence

Remark A.6 .

Given vectors the wedge product is an element in the exterior algebra . Using the chosen orthogonal basis we can identify the underlying vector space of with and using this identification (set )

where is the permutation group on letters. That is, the top degree piece of in is always the wedge product. It is then easy to check that the squared magnitude of this wedge product is

(29)

The term in the innermost bracket is the determinant of the submatrix with columns and in the special case where we see that the squared magnitude is just the square of the determinant of the matrix .

The wedge product of -vectors in can be thought of as an oriented -simplex, and the magnitude of this wedge product in the Clifford algebra computes the volume.

Definition A.7 .

The volume of a -simplex in with vertices is