1 Introduction
In this paper we propose a variation of the Transformer (Vaswani et al., 2017) that is designed to allow it to better incorporate structure into its representations. We test the proposal on a task where structured representations are expected to be particularly helpful: math wordproblem solving, where, among other things, correctly parsing expressions and compositionally evaluating them is crucial. Given as input a freeform math question in the form of a character sequence like Let r(g) be the second derivative of 2*g**3/3  21*g**2/2 + 10*g. Let z be r(7). Factor z*s + 6  9*s**2 + 0*s + 6*s**2., the model must produce an answer matching the specified target charactersequence (s + 3)*(3*s  2) exactly. Our proposed model is trained endtoend and infers the correct answer for novel examples without any taskspecific structural biases.
We begin by viewing the Transformer as a kind of Graph Neural Network
(e.g., Gori et al., 2005). For concreteness, consider the encoder component of a Transformer with heads. When the head of a cell of layer issues a query and as a result concentrates its selfattention distribution on another cell in layer , we can view these two cells as joined by an edge in an informationflow graph: the information content at in effect passes via this edge to affect the state of . The strength of this attention can be viewed as a weight on this edge, and the index of the head can be viewed as a label. Thus, each layer of the Transformer can be viewed as a complete, directed, weighted, labeled graph. Prior NLP work has interpreted certain edges of these graphs in terms of linguistic relations (Sec. 7), and we wish to enrich the relation structure of these graphs to better support the explicit representation of relations within the Transformer.Here we propose to replace each of the discrete edge labels , with a relation vector
: we create a bona fide representational space for the relations being learned by the Transformer. This makes it possible for the hidden representation at each cell to approximate the vector embedding of a symbolic structure built from the relations generated by that cell. This embedding is a
TensorProduct Representation (TPR; Smolensky, 1990) in an endtoenddifferentiable TPR system (Schlag & Schmidhuber, 2018; Schmidhuber, 1992) that learns “internal spotlights of attention” (Schmidhuber, 1993). TPRs provide a general method for embedding symbol structures in vector spaces. TPRs support compositional processing by directly encoding constituent structure: the representation of a structure is the sum of the representation of its constituents. The representation of each constituent is built compositionally from two vectors: one vector that embeds the content of the constituent, the ‘filler’ — here, the vector returned by attention — and a second vector that embeds the structural role it fills — here, a relation conceptually labeling an edge of the attention graph. The vector that embeds a filler and the vector that embeds the role it fills are bound together by the tensor product to form the tensor that embeds the constituent that they together define.^{1}^{1}1 The tensor product operation (when the roleembedding vectors are linearly independent) enables the sum of constituents representing the structure as a whole to be uniquely decomposable back into individual pairs of roles and their fillers, if necessary. The relations here, and the structures they define, are learned unsupervised by the Transformer in service of a task; posthoc analysis is then required to interpret those roles.In the new model, the TPTransformer, each head of each cell generates a key, value and queryvector, as in the Transformer, but additionally generates a rolevector (which we refer to in some contexts as a ‘relation vector’). The query is interpreted as seeking the appropriate filler for that role (or equivalently, the appropriate stringlocation for fulfilling that relation). Each head binds that filler to its role via the tensor product (or some contraction of it), and these filler/role bindings are summed to form the TPR of a structure with constituents (details in Sec. 2).
An interpretation of an actual learned relation illustrates this (see Fig. 4 in Sec. 5.2). One head of our trained model can be interpreted as partially encoding the relation secondargumentof. The toplayer cell dominating an input digit seeks the operator of which the digit is in the secondargument role. That cell generates a vector signifying this relation, and retrieves a value vector describing the operator from position that stands in this relation. The result of this head’s attention is then the binding of filler to role ; this binding is added to the bindings resulting from the cell’s other attention heads.
On the Mathematics Dataset (Sec. 3), the new model sets a new state of the art for the overall accuracy (Sec. 4), and for all the individualproblemtype module accuracies (Fig. 2). Initial results of interpreting the learned roles for the arithmeticproblem module show that they include a good approximation to the secondargument role of the division operator and that they distinguish between numbers in the numerator and denominator roles (Sec. 5).
More generally, it is shown that MultiHead Attention layers not only capture a subspace of the attended cell but capture nearly the full information content (Sec. 6.1). An argument is provided that multiple layers of standard attention suffer from the binding problem, and it is shown theoretically how the proposed TPAttention avoids such ambiguity (Sec. 6.2). The paper closes with a discussion of related work (Sec. 7) and a conclusion (Sec. 8).
2 The TPTransformer
The TPTransformer’s encoder network, like the Transformer’s encoder (Vaswani et al., 2017), can be described as a 2dimensional lattice of cells where are the sequence elements of the input and are the layer indices with as the embedding layer. All cells share the same topology and the cells of the same layer share the same weights. More specifically, each cell consists of an initial layer normalization (LN) followed by a TPMultiHead Attention (TPMHA)
sublayer followed by a fullyconnected feedforward (FF) sublayer. Each sublayer is followed by layer normalization (LN) and by a residual connection (as in the original Transformer; Eq.
1). Our cell structure follows directly from the official TensorFlow source code by
Vaswani et al. (2017) but with regular MultiHead Attention replaced by our TPMHA layer.The input into is the output of and doesn’t depend on the state of any other cells of the same layer, which allows a layer’s outputs to be computed in parallel.
(1) 
We represent the symbols of the input string as onehot vectors where is the size of the vocabulary and the respective columns of the matrix are the embedding vectors of those symbols. We also include a positional representation using the same sinusoidal encoding schema introduced by Vaswani et al. (2017). The input of the firstlayer is :
(2) 
where , , is a position and symboldependent role representation, and is elementwise multiplication (a contraction of the tensor product: see Sec. 2.1).
2.1 TPMultiHead Attention
The TPMHA layer of the encoder consists of heads that can be applied in parallel. Every head applies separate affine transformations , to produce key, value, query, and relation vectors from the hidden state , where :
(3) 
The filler of the attention head is
(4) 
i.e., a weighted sum of all values of the same layer and attention head (see Fig. 1). Here is a continuous degree of match given by the softmax of the dot product between the query vector at position and the key vector at position :
(5) 
The scale factor
can be motivated as a variancereducing factor under the assumption that the elements of
and are uncorrelated variables with mean 0 and variance 1, in order to initially keep the values of the softmax in a region with better gradients.Finally, we bind the filler with our relation vector , followed by an affine transformation before it is summed up with the other heads’ bindings to form the TPR of a structure with constituents: this is the output of the TPMHA layer.
(6) 
Note that, in this binding, to control dimensionality, we use a contraction of the tensor product, pointwise multiplication : this is the diagonal of the tensor product. For discussion, see the Appendix.
It is worth noting that the TPMHA layer returns a vector that is quadratic in the inputs to the layer: the vectors that are linearly combined to form (Eq. 4), and , are both linear in the (Eq. 3), and they are multiplied together to form the output of TPMHA (Eq. 6). This means that, unlike regular attention, TPMHA can increase, over successive layers, the polynomial degree of its representations as a function of the original input to the Transformer. Although it is true that the feedforward layer following attention (Sec. 2.2) introduces its own nonlinearity even in the standard Transformer, in the TPTransformer the attention mechanism itself goes beyond mere linear recombination of vectors from the previous layer. This provides further potential for the construction of increasingly abstract representations in higher layers.
2.2 Feedforward Layer
The feedforward layer of a cell consists of an affine transformation followed by a ReLU activation and a second affine transformation:
(7) 
Here, and is the function’s argument. As in previous work, we set .
2.3 The Decoder Network
The decoder network is a separate network with a similar structure to the encoder that takes the hidden states of the encoder and autoregressively generates the output sequence. In contrast to the encoder network, the cells of the decoder contain two TPMHA layers and one feedforward layer. We designed our decoder network analogously to Vaswani et al. (2017)
where the first attention layer attends over the masked decoder states while the second attention layer attends over the final encoder states. During training, the decoder network receives the shifted targets (teacherforcing) while during inference we use the previous symbol with highest probability (greedydecoding). The final symbol probability distribution is given by
(8) 
where is the hidden state of the last layer of the decoder at decoding step of the output sequence and is the shared symbol embedding of the encoder and decoder.
3 The Mathematics Dataset
The Mathematics Dataset (Saxton et al., 2019) is a large collection of math problems of various types, including algebra, arithmetic, calculus, numerical comparison, measurement, numerical factorization, and probability. Its main goal is to investigate the capability of neural networks to reason formally. Each problem is structured as a characterlevel sequencetosequence problem. The input sequence is a freeform math question or command like What is the first derivative of 13*a**2  627434*a + 11914106? from which our model correctly predicts the target sequence 26*a  627434. Another example from a different module is Calculate 66.6*12.14. which has 808.524 as its target sequence.
The dataset is structured into 56 modules which cover a broad spectrum of mathematics up to university level. It is procedurally generated and comes with 2 million pregenerated training samples per module. The authors provide an interpolation dataset for every module, as well as a few extrapolation datasets as an additional measure of algebraic generalization.
We merge the different training splits traineasy, trainmedium, and trainhard from all modules into one big training dataset of 120 million unique samples. From this dataset we extract a characterlevel vocabulary of 72 symbols, including startofsentence, endofsentence, and padding symbols^{2}^{2}2Note that Saxton et al. (2019) report a vocabulary size of 95, but this figure encompasses characters that never appear in the pregenerated training and test data..
4 Experimental results
We evaluate our trained model on the concatenated interpolation and extrapolation datasets of the pregenerated files, achieving a new state of the art: see Table 1. For a more detailed comparison, Fig. 2 shows the interpolation and extrapolation performance of every module separately. The TPTransformer matches or outperforms the Transformer in every module but one (probability__swr_p_sequence). Our model never quite converged, and was stopped prematurely after 1.7 million steps. We trained our model on one server with 4 V100 Nvidia GPUs for 25 days.
weights  steps  train  interpolation  extrapolation  
acc  >95%  acc  >95%  
Simple LSTM  18M  500k    57.00%  6  41.00%  1  
Transformer (Saxton et al.)  30M  500k    76.00%  13  50.00%  1  
Transformer (ours)  44.2M 







TPTransformer (ours)  49.2M 






4.1 Implementation Details
We initialize the symbol embedding matrix from , from , and all other matrices using the Xavier uniform initialization as introduced by Glorot & Bengio (2010). The model parameters are set to
. We were not able to train the TPTransformer, nor the regular Transformer, using the learning rate and gradient clipping scheme described by
Saxton et al. (2019). Instead we proceed as follows: The gradients are computed using PyTorch’s Autograd engine and their gradient norm is clipped at 0.1. The optimizer we use is also Adam, but with a smaller
. We train with a batch size of 1024 up to 1.7 million steps.5 Interpreting the learned structure
We report initial results of analyzing the learned structure of the encoder network’s last layer from our 700kstep TPTransformer.
5.1 Interpreting the learned roles
To this end, we sample 128 problems from the interpolation dataset of the arithmetic__mixed module and collect the role vectors from a randomly chosen head. We use means with to cluster the role vectors from different samples and different time steps of the final layer of the encoder. Interestingly, we find separate clusters for digits in the numerator and denominator of fractions. When there is a fraction of fractions we can observe that these assignments are placed such that the second fraction reverses, arguably simplifying the division of fractions into a multiplication of fractions (see Fig. 3).
5.2 Interpreting the attention maps
In Fig. 4 we display three separate attention weight vectors of one head of the last TPTransformer layer of the encoder. Gold boxes are overlaid to highlight mostrelevant portions. The row above the attention mask indicates the symbols that give information to the symbol in the bottom row. In each case, they give to ‘/’. Seen most simply in the first example, this attention can be interpreted as encoding a relation secondargumentof holding between the attended digits and the ‘/’ operator. The second and third examples show that several numerals in the denominator can participate in this relation. The third display shows how a numeratornumeral (297) intervening between two denominatornumerals is skipped for this relation.
6 Insights and deficits of multiple multihead attention layers
6.1 MultiHead Attention subspaces capture virtually all information
It was claimed by Vaswani et al. (2017) that Multihead attention allows the model to jointly attend to information from different representation subspaces at different positions. In this section, we show that in our trained models, an individual attention head does not access merely a subset of the information in the attended cell but instead captures nearly the full information content.
Let us consider a toy example where the attention layer of only attends to . In this setting, the postattention representation simplifies and becomes
(9) 
where and are the respective affine maps (see Sec. 2.1). Note that even though is a projection into an 8 times smaller vector space, it remains to be seen whether the hidden state loses information about . We empirically test to what extent the trained Transformer and TPTransformer lose information. To this end, we randomly select samples and extract the hidden state of the last layer of the encoder , as well as the value representation for every head. We then train an affine model to reconstruct from , the value vector of the single head :
(10) 
For both trained models, the TPTransformer and the regular Transformer, the mean squared error averaged across all heads is only ~0.017 and ~0.009 respectively. This indicates that the attention mechanism incorporates not just a subspace of the states it attends to, but affine transformations of those states that preserve nearly the full information content.
6.2 The Binding Problem of stacked Attention layers
The binding problem refers to the problem of binding features together into objects while keeping them separated from other objects. It has been studied in the context of theoretical neuroscience (von der Malsburg, 1981, 1994)
but also with regards to connectionist machine learning models
(Hinton et al., 1984). The purpose of a binding mechanism is to enable the fully distributed representation of symbolic structure (like a hierarchy of features) which has recently resurfaced as an important direction for neural network research
(Lake & Baroni, 2017; Bahdanau et al., 2018; van Steenkiste et al., 2019; Palangi et al., 2017; Tang et al., 2018).In this section, we describe how the standard attention mechanism is ill suited to capture complex nested representations, and we provide an intuitive understanding of the benefit of our TPAttention. We understand the attention layer of a cell as the means by which the subject (the cell state) queries all other cells for an object. We then show how a hierarchical representation of multiple queries becomes ambiguous in multiple standard attention layers.
Consider the string (a/b)/(c/d). A good neural representation captures the hierarchical structure of the string such that it will not be confused with the similarlooking but structurally different string (a/d)/(c/b)
. Our TPAttention makes use of a binding mechanism in order to explicitly support complex structural relations by binding together the object representations receiving high attention with a subjectspecific role representation. Let us continue with a more technical example. Consider a simplified Transformer network where every cell consists only of a singlehead attention layer with a residual connection: no feedforward layer or layer normalization, and let us assume no bias terms in the maps
and introduced in the previous section (Eq. 9). In this setting, assume that only attends to , and only attends to where are distinct positions of the input sequence. In this case(11) 
Suppose now that, for hierarchical grouping, the next layer attends to both and (equally, each with attention weight ). This results in the representation
(12) 
Note that the final representation is ambiguous in the sense that it is unclear by looking only at Eq. 12 whether has picked or . Either scenario would have led to the same outcome, which means that the network would not be able to distinguish between these two different structures (as in confusing (a/b)/(c/d) with (a/d)/(c/b)). In order to resolve this ambiguity, the standard Transformer must recruit other attention heads or find suitable nonlinear maps in between attention layers, but it remains uncertain how the network might achieve a clean separation.
Our TPAttention mechanism, on the other hand, specifically removes this ambiguity. Now Eqs. 11 and 12 become:
(13) 
Note that the final representation is not ambiguous anymore. Binding the filler symbols (our objects) with a subjectspecific role representation as described in Eq. 6 breaks the structural symmetry we had with regular attention. It is now simple for the network to specifically distinguish the two different structures.
7 Related work
Several recent studies have shown that the Transformerbased language model BERT (Devlin et al., 2018) captures linguistic relations such as those expressed in dependencyparse trees. This was shown for BERT’s hidden activation states (Hewitt & Manning, 2019; Tenney et al., 2019) and, most directly related to the present work, for the graph implicit in BERT’s attention weights (Coenen et al., 2019; Lin et al., 2019). Future work applying the TPTransformer to language tasks (like those on which BERT is trained) will enable us to study the connection between the explicit relations the TPTransformer learns and the implicit relations that have been extracted from BERT.
8 Conclusion
We have introduced the TPTransformer, which enables the powerful Transformer architecture to learn to explicitly encode structural relations using TensorProduct Representations. On the novel and challenging Mathematics Dataset, TPTransformer beats the previously published state of the art by 8.24%. Our initial analysis of this model’s final layer suggests that the TPTransformer naturally learns to cluster symbol representations based on their structural position and relation to other symbols.
References
 Bahdanau et al. (2018) Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic generalization: What is required and can it be learned? arXiv preprint arXiv:1811.12889, 2018.
 Coenen et al. (2019) Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, and Martin Wattenberg. Visualizing and measuring the geometry of BERT. arXiv preprint arXiv:1906.02715, 2019.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256, 2010.  Gori et al. (2005) Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Proceedings of the IEEE International Joint Conference on Neural Networks, volume 2, pp. 729–734. IEEE, 2005.
 Hewitt & Manning (2019) John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138, 2019.
 Hinton et al. (1984) Geoffrey E Hinton, James L McClelland, David E Rumelhart, et al. Distributed representations. CarnegieMellon University Pittsburgh, PA, 1984.
 Lake & Baroni (2017) Brenden M Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequencetosequence recurrent networks. arXiv preprint arXiv:1711.00350, 2017.
 Lin et al. (2019) Yongjie Lin, Yi Chern Tan, and Robert Frank. Open sesame: Getting inside BERT’s linguistic knowledge. arXiv preprint arXiv:1906.01698, 2019.
 Palangi et al. (2017) Hamid Palangi, Paul Smolensky, Xiaodong He, and Li Deng. Deep learning of grammaticallyinterpretable representations through questionanswering. arXiv preprint arXiv:1705.08432, 2017.
 Saxton et al. (2019) David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
 Schlag & Schmidhuber (2018) Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third order tensor products. In Advances in Neural Information Processing Systems (NeurIPS), pp. 9981–9993, 2018.
 Schmidhuber (1993) J. Schmidhuber. On decreasing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pp. 460–463. Springer, 1993.
 Schmidhuber (1992) Jürgen Schmidhuber. Learning to control fastweight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
 Smolensky (1990) P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell., 46(12):159–216, November 1990. ISSN 00043702. doi: 10.1016/00043702(90)90007M. URL http://dx.doi.org/10.1016/00043702(90)90007M.
 Tang et al. (2018) Shuai Tang, Paul Smolensky, and Virginia R de Sa. Learning distributed representations of symbolic structure using binding and unbinding operations. arXiv preprint arXiv:1810.12456, 2018.
 Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950, 2019.
 van Steenkiste et al. (2019) Sjoerd van Steenkiste, Klaus Greff, and Jürgen Schmidhuber. A perspective on objects and systematic generalization in modelbased rl. arXiv preprint arXiv:1906.01035, 2019.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
 von der Malsburg (1981) Christoph von der Malsburg. The correlation theory of brain function (internal report 812). Goettingen: Department of Neurobiology, Max Planck Intitute for Biophysical Chemistry, 1981.
 von der Malsburg (1994) Christoph von der Malsburg. The correlation theory of brain function. In E. Domany, J. L. van Hemmen, and K. Schulten (eds.), Models of neural networks II, pp. 95–119. Springer, Berlin, 1994.
Appendix A Appendix: Relations between Hadamard and tensorproductbinding
a.1 General considerations
In the version of the TPTransformer studied in this paper, binding of relations to their values is not done by the tensor product, , as in full TPRs. Rather, a contraction of the full TPR is used: the diagonal, which is the elementwise or Hadamard product .^{3}^{3}3This is a vector, and should not be confused with the inner product which is a scalar: the inner product is the sum of all the elements of the Hadamard product. To what extent does Hadamardproduct binding share relevant properties with tensorproduct binding?
A crucial property of the tensor product for its use in vector representations of structure is that a structure like is not confusable with , unlike the frequentlyused bagofwords encoding: in the BOW encoding of , the pair of arguments to the operator are encoded simply as , where and are respectively the vector encodings of and . Obviously, this cannot be distinguished from the BOW encoding of the argument pair in , . (Hence the name, symbol “bag”, as opposed to symbol “structure”.)
In a tensorproduct representation of the argument pair in , we have , where and are respectively distinct vector embeddings of the numerator (or firstargument) and denominator (or secondargument) roles, and is the tensor product. This is distinct from , the embedding of the argumentpair in . (In Sec. 6.2
of the paper, an aspect of this general property, in the context of attention models, is discussed. In Sec.
5, visualization of the roles and the perroleattention show that this particular distinction, between the numerator and denominator roles, is learned and used by the trained TPTransformer model.)This crucial property of the tensor product, that , is shared by the Hadamard product: if we now take to represent the Hadamard product, the inequality remains true. To achieve this important property, the full tensor product is not required: the Hadamard product is the diagonal of the tensor product, which retains much of the product structure of the tensor product. In any application, it is an empirical question how much of the full tensor product is required to successfully encode distinctions between bindings of symbols to roles; in the TPTransformer, it turns out that the diagonal of the tensor product is sufficient to get improvement in performance over having no symbolroleproduct structure at all. Unfortunately, the compute requirements of training on the Mathematics Dataset currently makes using the full tensor product infeasible, unless the vector representations of symbols and roles are reduced to dimensions that proved to be too small for the task. When future compute makes it possible, we expect that expanding from the diagonal to the full tensor product will provide further improvement in performance and interpretability.
We next move beyond these general considerations and consider a setting in which Hadamardproduct attention can yield an optimal approximation to tensorproduct attention.
a.2 Hadamardproduct attention as an optimal approximation to tensorproduct attention
In Eq. 6 for TPMHA, we have a sum over all heads of an affinetransformed product of a value vector and a role vector . (Throughout this discussion, we leave the subscripts implicit, as well as the overbar on in Eq. 6.) In a hypothetical, fullTPR formulation, this product would be the tensor product , although in our actual proposed TPTransformer, the Hadamard (elementwise) product (the diagonal of ) is used. The appropriateness of the compression from tensor product to Hadamard product can be seen as follows.
In the hypothetical fullTPR version of TPMHA, attention would return the sum of tensor products. This tensor would have rank at most , potentially enabling a substantial degree of compression across all tensors the model will compute over the data of interest. Given the translationinvariance built into the Transformer via positioninvariant parameters, the same compression must be applied in all positions within a given layer , although the compression may vary across heads. For the compression of we will need more than components, as this decomposition needs to be optimal over all all tensors in that layer for all data points.
In detail, for each head , the compression of the tensor (or matrix ) is to dimension , which will ultimately be mapped to dimension (to enable addition with via the residual connection of Eq. 1) by the affine transformation of Eq. 6 . The optimal dimensional compression for head at layer would preserve the
dominant dimensions of variance of the attentiongenerated states for that head and layer, across all positions and inputs: a kind of singularvalue decomposition retaining those dimensions with the principal singular values. Denote these principal directions by
, and let and respectively be thematrices with the orthonormal vectors
and as columns. (Note that orthonormality implies that and , with the identity matrix.)The compression of , , will lie within the space spanned by these tensor products , i.e., ; in matrix form, , where is the diagonal matrix with elements . Thus the dimensions of the compressed matrix that approximates are given by:
where , . Now from Eq. 3, , so . Thus by changing the parameters to , and analogously for the role parameters , we convert our original hypothetical TPR attention tensor to its optimal dimensional approximation, in which the tensor product of the original vectors
is replaced by the Hadamard product of the linearlytransformed vectors
. Therefore, in the proposed model, which deploys the Hadamard product, learning simply needs to converge to the parameters rather than the parameters .
Comments
There are no comments yet.