 # Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving

We incorporate Tensor-Product Representations within the Transformer in order to better support the explicit representation of relation structure. Our Tensor-Product Transformer (TP-Transformer) sets a new state of the art on the recently-introduced Mathematics Dataset containing 56 categories of free-form math word-problems. The essential component of the model is a novel attention mechanism, called TP-Attention, which explicitly encodes the relations between each Transformer cell and the other cells from which values have been retrieved by attention. TP-Attention goes beyond linear combination of retrieved values, strengthening representation-building and resolving ambiguities introduced by multiple layers of standard attention. The TP-Transformer's attention maps give better insights into how it is capable of solving the Mathematics Dataset's challenging problems. Pretrained models and code will be made available after publication.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we propose a variation of the Transformer (Vaswani et al., 2017) that is designed to allow it to better incorporate structure into its representations. We test the proposal on a task where structured representations are expected to be particularly helpful: math word-problem solving, where, among other things, correctly parsing expressions and compositionally evaluating them is crucial. Given as input a free-form math question in the form of a character sequence like Let r(g) be the second derivative of 2*g**3/3 - 21*g**2/2 + 10*g. Let z be r(7). Factor -z*s + 6 - 9*s**2 + 0*s + 6*s**2., the model must produce an answer matching the specified target character-sequence -(s + 3)*(3*s - 2) exactly. Our proposed model is trained end-to-end and infers the correct answer for novel examples without any task-specific structural biases.

We begin by viewing the Transformer as a kind of Graph Neural Network

(e.g., Gori et al., 2005). For concreteness, consider the encoder component of a Transformer with heads. When the head of a cell of layer issues a query and as a result concentrates its self-attention distribution on another cell in layer , we can view these two cells as joined by an edge in an information-flow graph: the information content at in effect passes via this edge to affect the state of . The strength of this attention can be viewed as a weight on this edge, and the index of the head can be viewed as a label. Thus, each layer of the Transformer can be viewed as a complete, directed, weighted, labeled graph. Prior NLP work has interpreted certain edges of these graphs in terms of linguistic relations (Sec. 7), and we wish to enrich the relation structure of these graphs to better support the explicit representation of relations within the Transformer.

Here we propose to replace each of the discrete edge labels , with a relation vector

: we create a bona fide representational space for the relations being learned by the Transformer. This makes it possible for the hidden representation at each cell to approximate the vector embedding of a symbolic structure built from the relations generated by that cell. This embedding is a

Tensor-Product Representation (TPR; Smolensky, 1990) in an end-to-end-differentiable TPR system (Schlag & Schmidhuber, 2018; Schmidhuber, 1992) that learns “internal spotlights of attention” (Schmidhuber, 1993). TPRs provide a general method for embedding symbol structures in vector spaces. TPRs support compositional processing by directly encoding constituent structure: the representation of a structure is the sum of the representation of its constituents. The representation of each constituent is built compositionally from two vectors: one vector that embeds the content of the constituent, the ‘filler’ — here, the vector returned by attention — and a second vector that embeds the structural role it fills — here, a relation conceptually labeling an edge of the attention graph. The vector that embeds a filler and the vector that embeds the role it fills are bound together by the tensor product to form the tensor that embeds the constituent that they together define.111 The tensor product operation (when the role-embedding vectors are linearly independent) enables the sum of constituents representing the structure as a whole to be uniquely decomposable back into individual pairs of roles and their fillers, if necessary. The relations here, and the structures they define, are learned unsupervised by the Transformer in service of a task; post-hoc analysis is then required to interpret those roles.

In the new model, the TP-Transformer, each head of each cell generates a key-, value- and query-vector, as in the Transformer, but additionally generates a role-vector (which we refer to in some contexts as a ‘relation vector’). The query is interpreted as seeking the appropriate filler for that role (or equivalently, the appropriate string-location for fulfilling that relation). Each head binds that filler to its role via the tensor product (or some contraction of it), and these filler/role bindings are summed to form the TPR of a structure with constituents (details in Sec. 2).

An interpretation of an actual learned relation illustrates this (see Fig. 4 in Sec. 5.2). One head of our trained model can be interpreted as partially encoding the relation second-argument-of. The top-layer cell dominating an input digit seeks the operator of which the digit is in the second-argument role. That cell generates a vector signifying this relation, and retrieves a value vector describing the operator from position that stands in this relation. The result of this head’s attention is then the binding of filler to role ; this binding is added to the bindings resulting from the cell’s other attention heads.

On the Mathematics Dataset (Sec. 3), the new model sets a new state of the art for the overall accuracy (Sec. 4), and for all the individual-problem-type module accuracies (Fig. 2). Initial results of interpreting the learned roles for the arithmetic-problem module show that they include a good approximation to the second-argument role of the division operator and that they distinguish between numbers in the numerator and denominator roles (Sec. 5).

More generally, it is shown that Multi-Head Attention layers not only capture a subspace of the attended cell but capture nearly the full information content (Sec. 6.1). An argument is provided that multiple layers of standard attention suffer from the binding problem, and it is shown theoretically how the proposed TP-Attention avoids such ambiguity (Sec. 6.2). The paper closes with a discussion of related work (Sec. 7) and a conclusion (Sec. 8).

## 2 The TP-Transformer

The TP-Transformer’s encoder network, like the Transformer’s encoder (Vaswani et al., 2017), can be described as a 2-dimensional lattice of cells where are the sequence elements of the input and are the layer indices with as the embedding layer. All cells share the same topology and the cells of the same layer share the same weights. More specifically, each cell consists of an initial layer normalization (LN) followed by a TP-Multi-Head Attention (TPMHA)

sub-layer followed by a fully-connected feed-forward (FF) sub-layer. Each sub-layer is followed by layer normalization (LN) and by a residual connection (as in the original Transformer; Eq.

1

). Our cell structure follows directly from the official TensorFlow source code by

Vaswani et al. (2017) but with regular Multi-Head Attention replaced by our TPMHA layer.

The input into is the output of and doesn’t depend on the state of any other cells of the same layer, which allows a layer’s outputs to be computed in parallel.

 ht,l=zt,l+TPMHA(LN(zt,l),LN(z1:T,l))zt,l+1=LN(ht,l+FF(LN(ht,l))) (1)

We represent the symbols of the input string as one-hot vectors where is the size of the vocabulary and the respective columns of the matrix are the embedding vectors of those symbols. We also include a positional representation using the same sinusoidal encoding schema introduced by Vaswani et al. (2017). The input of the first-layer is :

 et=Ext√dz+ptrt=W(p)et+b(p)zt,0=et⊙rt (2)

where , , is a position- and symbol-dependent role representation, and is elementwise multiplication (a contraction of the tensor product: see Sec. 2.1).

The TPMHA layer of the encoder consists of heads that can be applied in parallel. Every head applies separate affine transformations , to produce key, value, query, and relation vectors from the hidden state , where :

 (3)

The filler of the attention head is

 ¯vht,l=T∑i=1vhi,lαh,it,l, (4)

i.e., a weighted sum of all values of the same layer and attention head (see Fig. 1). Here is a continuous degree of match given by the softmax of the dot product between the query vector at position and the key vector at position :

 αh,it,l=exp(qht,l⋅khi,l1√dk)∑Ti′=1exp(qht,l⋅khi′,l1√dk) (5)

The scale factor

can be motivated as a variance-reducing factor under the assumption that the elements of

and are uncorrelated variables with mean 0 and variance 1, in order to initially keep the values of the softmax in a region with better gradients.

Finally, we bind the filler with our relation vector , followed by an affine transformation before it is summed up with the other heads’ bindings to form the TPR of a structure with constituents: this is the output of the TPMHA layer.

 TPMHA(zt,l,z1:T,l)=∑h[W(o)h,l(¯vht,l⊙rht,l)+b(o)h,l] (6)

Note that, in this binding, to control dimensionality, we use a contraction of the tensor product, pointwise multiplication : this is the diagonal of the tensor product. For discussion, see the Appendix. Figure 1: A simplified illustration of our TP-Attention mechanism for one head at position t in layer l. The main difference from standard Attention is the additional role representation that is element-wise multiplied with the filler/value representation.

It is worth noting that the TPMHA layer returns a vector that is quadratic in the inputs to the layer: the vectors that are linearly combined to form (Eq. 4), and , are both linear in the (Eq. 3), and they are multiplied together to form the output of TPMHA (Eq. 6). This means that, unlike regular attention, TPMHA can increase, over successive layers, the polynomial degree of its representations as a function of the original input to the Transformer. Although it is true that the feed-forward layer following attention (Sec. 2.2) introduces its own non-linearity even in the standard Transformer, in the TP-Transformer the attention mechanism itself goes beyond mere linear re-combination of vectors from the previous layer. This provides further potential for the construction of increasingly abstract representations in higher layers.

### 2.2 Feed-forward Layer

The feed-forward layer of a cell consists of an affine transformation followed by a ReLU activation and a second affine transformation:

 FF(x)=W(g)lReLU(W(f)lx+b(f)l)+b(g)l (7)

Here, and is the function’s argument. As in previous work, we set .

### 2.3 The Decoder Network

The decoder network is a separate network with a similar structure to the encoder that takes the hidden states of the encoder and auto-regressively generates the output sequence. In contrast to the encoder network, the cells of the decoder contain two TPMHA layers and one feed-forward layer. We designed our decoder network analogously to Vaswani et al. (2017)

where the first attention layer attends over the masked decoder states while the second attention layer attends over the final encoder states. During training, the decoder network receives the shifted targets (teacher-forcing) while during inference we use the previous symbol with highest probability (greedy-decoding). The final symbol probability distribution is given by

 ^y^t=softmax(ET^z^t,L) (8)

where is the hidden state of the last layer of the decoder at decoding step of the output sequence and is the shared symbol embedding of the encoder and decoder.

## 3 The Mathematics Dataset

The Mathematics Dataset (Saxton et al., 2019) is a large collection of math problems of various types, including algebra, arithmetic, calculus, numerical comparison, measurement, numerical factorization, and probability. Its main goal is to investigate the capability of neural networks to reason formally. Each problem is structured as a character-level sequence-to-sequence problem. The input sequence is a free-form math question or command like What is the first derivative of 13*a**2 - 627434*a + 11914106? from which our model correctly predicts the target sequence 26*a - 627434. Another example from a different module is Calculate 66.6*12.14. which has 808.524 as its target sequence.

The dataset is structured into 56 modules which cover a broad spectrum of mathematics up to university level. It is procedurally generated and comes with 2 million pre-generated training samples per module. The authors provide an interpolation dataset for every module, as well as a few extrapolation datasets as an additional measure of algebraic generalization.

We merge the different training splits train-easy, train-medium, and train-hard from all modules into one big training dataset of 120 million unique samples. From this dataset we extract a character-level vocabulary of 72 symbols, including start-of-sentence, end-of-sentence, and padding symbols222Note that Saxton et al. (2019) report a vocabulary size of 95, but this figure encompasses characters that never appear in the pre-generated training and test data..

## 4 Experimental results

We evaluate our trained model on the concatenated interpolation and extrapolation datasets of the pre-generated files, achieving a new state of the art: see Table 1. For a more detailed comparison, Fig. 2 shows the interpolation and extrapolation performance of every module separately. The TP-Transformer matches or out-performs the Transformer in every module but one (probability__swr_p_sequence). Our model never quite converged, and was stopped prematurely after 1.7 million steps. We trained our model on one server with 4 V100 Nvidia GPUs for 25 days.

### 4.1 Implementation Details

We initialize the symbol embedding matrix from , from , and all other matrices using the Xavier uniform initialization as introduced by Glorot & Bengio (2010). The model parameters are set to

. We were not able to train the TP-Transformer, nor the regular Transformer, using the learning rate and gradient clipping scheme described by

Saxton et al. (2019)

. Instead we proceed as follows: The gradients are computed using PyTorch’s Autograd engine and their gradient norm is clipped at 0.1. The optimizer we use is also Adam, but with a smaller

. We train with a batch size of 1024 up to 1.7 million steps.

## 5 Interpreting the learned structure

We report initial results of analyzing the learned structure of the encoder network’s last layer from our 700k-step TP-Transformer.

### 5.1 Interpreting the learned roles

To this end, we sample 128 problems from the interpolation dataset of the arithmetic__mixed module and collect the role vectors from a randomly chosen head. We use -means with to cluster the role vectors from different samples and different time steps of the final layer of the encoder. Interestingly, we find separate clusters for digits in the numerator and denominator of fractions. When there is a fraction of fractions we can observe that these assignments are placed such that the second fraction reverses, arguably simplifying the division of fractions into a multiplication of fractions (see Fig. 3). Figure 2: The accuracies of our implementation of the Transformer (700k steps) and the TP-Transformer (700k and 1.7M steps) for every module of the Mathematics Dataset. Figure 3: Samples of correctly processed problems from the arithmetic__mixed module. ‘#’ and ‘%’ are the start- and end-of-sentence symbols. The colored squares indicate the k-means cluster of the role-vector assigned by one head in the final layer in that position. Blue and gold rectangles respectively highlight numerator and denominator roles. They were discovered manually. Note how their placement is correctly swapped in rows 2, 3, and 4, where a number in the denominator of a denominator is treated as if in a numerator. Role-cluster 9 corresponds to the role ones-digit-of-a-numerator-factor, and 6 to ones-digit-of-a-denominator-factor; other such roles are also evident. Figure 4: TP-Transformer attention maps for three examples as described in section 5.2.

### 5.2 Interpreting the attention maps

In Fig. 4 we display three separate attention weight vectors of one head of the last TP-Transformer layer of the encoder. Gold boxes are overlaid to highlight most-relevant portions. The row above the attention mask indicates the symbols that give information to the symbol in the bottom row. In each case, they give to ‘/’. Seen most simply in the first example, this attention can be interpreted as encoding a relation second-argument-of holding between the attended digits and the ‘/’ operator. The second and third examples show that several numerals in the denominator can participate in this relation. The third display shows how a numerator-numeral (-297) intervening between two denominator-numerals is skipped for this relation.

## 6 Insights and deficits of multiple multi-head attention layers

### 6.1 Multi-Head Attention subspaces capture virtually all information

It was claimed by Vaswani et al. (2017) that Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. In this section, we show that in our trained models, an individual attention head does not access merely a subset of the information in the attended cell but instead captures nearly the full information content.

Let us consider a toy example where the attention layer of only attends to . In this setting, the post-attention representation simplifies and becomes

where and are the respective affine maps (see Sec. 2.1). Note that even though is a projection into an 8 times smaller vector space, it remains to be seen whether the hidden state loses information about . We empirically test to what extent the trained Transformer and TP-Transformer lose information. To this end, we randomly select samples and extract the hidden state of the last layer of the encoder , as well as the value representation for every head. We then train an affine model to reconstruct from , the value vector of the single head :

 ^zt,6=Whvh(zt,6)+bhe=1n(^zt,6−zt,6)2 (10)

For both trained models, the TP-Transformer and the regular Transformer, the mean squared error averaged across all heads is only ~0.017 and ~0.009 respectively. This indicates that the attention mechanism incorporates not just a subspace of the states it attends to, but affine transformations of those states that preserve nearly the full information content.

### 6.2 The Binding Problem of stacked Attention layers

The binding problem refers to the problem of binding features together into objects while keeping them separated from other objects. It has been studied in the context of theoretical neuroscience (von der Malsburg, 1981, 1994)

but also with regards to connectionist machine learning models

(Hinton et al., 1984)

. The purpose of a binding mechanism is to enable the fully distributed representation of symbolic structure (like a hierarchy of features) which has recently resurfaced as an important direction for neural network research

(Lake & Baroni, 2017; Bahdanau et al., 2018; van Steenkiste et al., 2019; Palangi et al., 2017; Tang et al., 2018).

In this section, we describe how the standard attention mechanism is ill suited to capture complex nested representations, and we provide an intuitive understanding of the benefit of our TP-Attention. We understand the attention layer of a cell as the means by which the subject (the cell state) queries all other cells for an object. We then show how a hierarchical representation of multiple queries becomes ambiguous in multiple standard attention layers.

Consider the string (a/b)/(c/d). A good neural representation captures the hierarchical structure of the string such that it will not be confused with the similar-looking but structurally different string (a/d)/(c/b)

. Our TP-Attention makes use of a binding mechanism in order to explicitly support complex structural relations by binding together the object representations receiving high attention with a subject-specific role representation. Let us continue with a more technical example. Consider a simplified Transformer network where every cell consists only of a single-head attention layer with a residual connection: no feed-forward layer or layer normalization, and let us assume no bias terms in the maps

and introduced in the previous section (Eq. 9). In this setting, assume that only attends to , and only attends to where are distinct positions of the input sequence. In this case

 za,l+1=za,l+ol(vl(zb,l))zc,l+1=zc,l+ol(vl(zd,l)) (11)

Suppose now that, for hierarchical grouping, the next layer attends to both and (equally, each with attention weight ). This results in the representation

 ze,l+2=ze,l+1+ol+1(vl+1(za,l+1+zc,l+1))/2=ze,l+1+ol+1(vl+1(za,l+zc,l+ol(vl(zb,l))+ol(vl(zd,l))))/2 (12)

Note that the final representation is ambiguous in the sense that it is unclear by looking only at Eq. 12 whether has picked or . Either scenario would have led to the same outcome, which means that the network would not be able to distinguish between these two different structures (as in confusing (a/b)/(c/d) with (a/d)/(c/b)). In order to resolve this ambiguity, the standard Transformer must recruit other attention heads or find suitable non-linear maps in between attention layers, but it remains uncertain how the network might achieve a clean separation.

Our TP-Attention mechanism, on the other hand, specifically removes this ambiguity. Now Eqs. 11 and 12 become:

 za,l+1=za,l+ol(vl(zb,l)⊙ra,l)zc,l+1=zc,l+ol(vl(zd,l)⊙rc,l)ze,l+2=ze,l+1+ol+1(vl+1(za,l+zc,l+ol(vl(zb,l)⊙ra,l)+ol(vl(zd,l)⊙rc,l)))/2 (13)

Note that the final representation is not ambiguous anymore. Binding the filler symbols (our objects) with a subject-specific role representation as described in Eq. 6 breaks the structural symmetry we had with regular attention. It is now simple for the network to specifically distinguish the two different structures.

## 7 Related work

Several recent studies have shown that the Transformer-based language model BERT (Devlin et al., 2018) captures linguistic relations such as those expressed in dependency-parse trees. This was shown for BERT’s hidden activation states (Hewitt & Manning, 2019; Tenney et al., 2019) and, most directly related to the present work, for the graph implicit in BERT’s attention weights (Coenen et al., 2019; Lin et al., 2019). Future work applying the TP-Transformer to language tasks (like those on which BERT is trained) will enable us to study the connection between the explicit relations the TP-Transformer learns and the implicit relations that have been extracted from BERT.

## 8 Conclusion

We have introduced the TP-Transformer, which enables the powerful Transformer architecture to learn to explicitly encode structural relations using Tensor-Product Representations. On the novel and challenging Mathematics Dataset, TP-Transformer beats the previously published state of the art by 8.24%. Our initial analysis of this model’s final layer suggests that the TP-Transformer naturally learns to cluster symbol representations based on their structural position and relation to other symbols.

## Appendix A Appendix: Relations between Hadamard- and tensor-product-binding

### a.1 General considerations

In the version of the TP-Transformer studied in this paper, binding of relations to their values is not done by the tensor product, , as in full TPRs. Rather, a contraction of the full TPR is used: the diagonal, which is the elementwise or Hadamard product .333This is a vector, and should not be confused with the inner product which is a scalar: the inner product is the sum of all the elements of the Hadamard product. To what extent does Hadamard-product binding share relevant properties with tensor-product binding?

A crucial property of the tensor product for its use in vector representations of structure is that a structure like is not confusable with , unlike the frequently-used bag-of-words encoding: in the BOW encoding of , the pair of arguments to the operator are encoded simply as , where and are respectively the vector encodings of and . Obviously, this cannot be distinguished from the BOW encoding of the argument pair in , . (Hence the name, symbol “bag”, as opposed to symbol “structure”.)

In a tensor-product representation of the argument pair in , we have , where and are respectively distinct vector embeddings of the numerator (or first-argument) and denominator (or second-argument) roles, and is the tensor product. This is distinct from , the embedding of the argument-pair in . (In Sec. 6.2

of the paper, an aspect of this general property, in the context of attention models, is discussed. In Sec.

5, visualization of the roles and the per-role-attention show that this particular distinction, between the numerator and denominator roles, is learned and used by the trained TP-Transformer model.)

This crucial property of the tensor product, that , is shared by the Hadamard product: if we now take to represent the Hadamard product, the inequality remains true. To achieve this important property, the full tensor product is not required: the Hadamard product is the diagonal of the tensor product, which retains much of the product structure of the tensor product. In any application, it is an empirical question how much of the full tensor product is required to successfully encode distinctions between bindings of symbols to roles; in the TP-Transformer, it turns out that the diagonal of the tensor product is sufficient to get improvement in performance over having no symbol-role-product structure at all. Unfortunately, the compute requirements of training on the Mathematics Dataset currently makes using the full tensor product infeasible, unless the vector representations of symbols and roles are reduced to dimensions that proved to be too small for the task. When future compute makes it possible, we expect that expanding from the diagonal to the full tensor product will provide further improvement in performance and interpretability.

We next move beyond these general considerations and consider a setting in which Hadamard-product attention can yield an optimal approximation to tensor-product attention.

### a.2 Hadamard-product attention as an optimal approximation to tensor-product attention

In Eq. 6 for TPMHA, we have a sum over all heads of an affine-transformed product of a value vector and a role vector . (Throughout this discussion, we leave the subscripts implicit, as well as the over-bar on in Eq. 6.) In a hypothetical, full-TPR formulation, this product would be the tensor product , although in our actual proposed TP-Transformer, the Hadamard (elementwise) product (the diagonal of ) is used. The appropriateness of the compression from tensor product to Hadamard product can be seen as follows.

In the hypothetical full-TPR version of TPMHA, attention would return the sum of tensor products. This tensor would have rank at most , potentially enabling a substantial degree of compression across all tensors the model will compute over the data of interest. Given the translation-invariance built into the Transformer via position-invariant parameters, the same compression must be applied in all positions within a given layer , although the compression may vary across heads. For the compression of we will need more than components, as this decomposition needs to be optimal over all all tensors in that layer for all data points.

In detail, for each head , the compression of the tensor (or matrix ) is to dimension , which will ultimately be mapped to dimension (to enable addition with via the residual connection of Eq. 1) by the affine transformation of Eq. 6 . The optimal -dimensional compression for head at layer would preserve the

dominant dimensions of variance of the attention-generated states for that head and layer, across all positions and inputs: a kind of singular-value decomposition retaining those dimensions with the principal singular values. Denote these principal directions by

, and let and respectively be the

matrices with the orthonormal vectors

and as columns. (Note that orthonormality implies that and , with the identity matrix.)

The compression of , , will lie within the space spanned by these tensor products , i.e., ; in matrix form, , where is the diagonal matrix with elements . Thus the dimensions of the compressed matrix that approximates are given by:

 ~Ah =M⊤h^AhNh≈M⊤hAhNh=M⊤hvhr⊤hNh=(M⊤hvh)(N⊤hrh)⊤ ~ahc =[~Ah]cc≈[M⊤hvh]c[N⊤hrh]c=[~vh⊙~rh]c

where , . Now from Eq. 3, , so . Thus by changing the parameters to , and analogously for the role parameters , we convert our original hypothetical TPR attention tensor to its optimal -dimensional approximation, in which the tensor product of the original vectors

is replaced by the Hadamard product of the linearly-transformed vectors

. Therefore, in the proposed model, which deploys the Hadamard product, learning simply needs to converge to the parameters rather than the parameters .