Tensor Product Generation Networks for Deep NLP Modeling

09/26/2017 ∙ by Qiuyuan Huang, et al. ∙ University of Florida IEEE 0

We present a new approach to the design of deep networks for natural language processing (NLP), based on the general technique of Tensor Product Representations (TPRs) for encoding and processing symbol structures in distributed neural networks. A network architecture --- the Tensor Product Generation Network (TPGN) --- is proposed which is capable in principle of carrying out TPR computation, but which uses unconstrained deep learning to design its internal representations. Instantiated in a model for image-caption generation, TPGN outperforms LSTM baselines when evaluated on the COCO dataset. The TPR-capable structure enables interpretation of internal representations and operations, which prove to contain considerable grammatical content. Our caption-generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we introduce a new architecture for natural language processing (NLP). On what type of principles can a computational architecture be founded? It would seem a sound principle to require that the hypothesis space for learning which an architecture provides include network hypotheses that are independently known to be suitable for performing the target task. Our proposed architecture makes available to deep learning network configurations that perform natural language generation by use of

Tensor Product Representations (TPRs) [29]. Whether learning will create TPRs is unknown in advance, but what we can say with certainty is that the hypothesis space being searched during learning includes TPRs as one appropriate solution to the problem.

TPRs are a general method for generating vector-space embeddings of complex symbol structures. Prior work has proved that TPRs enable powerful symbol processing to be carried out using neural network computation

[28]. This includes generating parse trees that conform to a grammar [5], although incorporating such capabilities into deep learning networks such as those developed here remains for future work. The architecture presented here relies on simpler use of TPRs to generate sentences; grammars are not explicitly encoded here.

We test the proposed architecture by applying it to image-caption generation (on the MS-COCO dataset, [6]). The results improve upon a baseline deploying a state-of-the-art LSTM architecture [33], and the TPR foundations of the architecture provide greater interpretability.

Section 2 of the paper reviews TPR. Section 3 presents the proposed architecture, the Tensor Product Generation Network (TPGN). Section 4 describes the particular model we study for image captioning, and Section 5 presents the experimental results. Importantly, what the model has learned is interpreted in Section 5.3. Section 6 discusses the relation of the new model to previous work and Section 7 concludes.

2 Review of tensor product representation

The central idea of TPRs [27] can be appreciated by contrasting the TPR for a word string with a bag-of-words (BoW) vector-space embedding. In a BoW embedding, the vector that encodes Jay saw Kay is the same as the one that encodes Kay saw Jay: where are respectively the vector embeddings of the words Jay, Kay, saw.

A TPR embedding that avoids this confusion starts by analyzing Jay saw Kay as the set {Jay/subj, Kay/obj, saw/verb}. (Other analyses are possible: see Section 3.) Next we choose an embedding in a vector space for Jay, Kay, saw as in the BoW case: . Then comes the step unique to TPRs: we choose an embedding in a vector space for the roles subj, obj, verb: , , . Crucially, . Finally, the TPR for Jay saw Kay is the following vector in :


Each word is tagged with the role it fills in the sentence; Jay and Kay fill different roles.

This TPR avoids the BoW confusion: because . In the terminology of TPRs, in Jay saw Kay, Jay is the filler of the role subj, and is the vector embedding of the filler/role binding Jay/subj. In the vector space embedding, the binding operation is the tensor — or generalized outer — product ; i.e., is a tensor with 2 indices defined by: .

The tensor product can be used recursively, which is essential for the TPR embedding of recursive structures such as trees and for the computation of recursive functions over TPRs. However, in the present context, recursion will not be required, in which case the tensor product can be regarded as simply the matrix outer product (which cannot be used recursively); we can regard as the matrix product . Then Equation 1 becomes


Note that the set of matrices (or the set of tensors with any fixed number of indices) is a vector space; thus Jay saw Kay is a vector-space embedding of the symbol structures constituting sentences. Whether we regard as a 2-index tensor or as a matrix, we can call it simply a ‘vector’ since it is an element of a vector space: in the context of TPRs, ‘vector’ is used in a general sense and should not be taken to imply a single-indexed array.

Crucial to the computational power of TPRs and to the architecture we propose here is the notion of unbinding. Just as an outer product — the tensor product — can be used to bind the vector embedding a filler Jay to the vector embedding a role subj, or , so an inner product can be used to take the vector embedding a structure and unbind a role contained within that structure, yielding the symbol that fills the role.

In the simplest case of orthonormal role vectors , to unbind role subj in Jay saw Kay we can compute the matrix-vector product: (because when the role vectors are orthonormal). A similar situation obtains when the role vectors are not orthonormal, provided they are not linearly dependent: for each role such as subj there is an unbinding vector such that so we get: . A role vector such as and its unbinding vector are said to be duals of each other. (If is the matrix in which each column is a role vector , then is invertible when the role vectors are linearly independent; then the unbinding vectors are the rows of . When the are orthonormal, . Replacing the matrix inverse with the pseudo-inverse allows approximate unbinding if the role vectors are linearly dependent.)

We can now see how TPRs can be used to generate a sentence one word at a time. We start with the TPR for the sentence, e.g., . From this vector we unbind the role of the first word, which is subj: the embedding of the first word is thus , the embedding of Jay. Next we take the TPR for the sentence and unbind the role of the second word, which is verb: the embedding of the second word is then , the embedding of saw. And so on.

To accomplish this, we need two representations to generate the word: (i) the TPR of the sentence, (or of the string of not-yet-produced words, ) and (ii) the unbinding vector for the word, . The architecture we propose will therefore be a recurrent network containing two subnetworks: (i) a subnet hosting the representation , and a (ii) a subnet hosting the unbinding vector . This is shown in Fig. 1.

Figure 1: Architecture of TPGN, a TPR-capable generation network. “” denotes the matrix-vector product.

3 A TPR-capable generation architecture

As Fig. 1 shows, the proposed Tensor Product Generation Network architecture (the dashed box labeled ) is designed to support the technique for generation just described: the architecture is TPR-capable. There is a sentence-encoding subnetwork which could host a TPR of the sentence to be generated, and an unbinding subnetwork which could output a sequence of unbinding vectors ; at time , the embedding of the word produced, , could then be extracted from via the matrix-vector product (shown in the figure by “”): . The lexical-decoding subnetwork converts the embedding vector to the 1-hot vector corresponding to the word .

Unlike some other work [23], TPGN is not constrained to literally learn TPRs. The representations that will actually be housed in and are determined by end-to-end deep learning on a task: the bubbles in Fig. 1 show what would be the meanings of and if an actual TPR scheme were instantiated in the architecture. The learned representations will not be proven to literally be TPRs, but by analyzing the unbinding vectors the network learns, we will gain insight into the process by which the learned matrices give rise to the generated sentence.

The task studied here is image captioning; Fig. 1 shows that the input to this TPGN model is an image, preprocessed by a CNN which produces the initial representation in , . This vector drives the entire caption-generation process: it contains all the image-specific information for producing the caption. (We will call a caption a “sentence” even though it may in fact be just a noun phrase.)

The two subnets and are mutually-connected LSTMs [13]: see Fig. 2. The internal hidden state of , , is sent as input to ; also produces output, the unbinding vector . The internal hidden state of , , is sent as input to , and also produced as output. As stated above, these two outputs are multiplied together to produce the embedding vector of the output word . Furthermore, the 1-hot encoding of is fed back at the next time step to serve as input to both and .

What type of roles might the unbinding vectors be unbinding? A TPR for a caption could in principle be built upon positional roles, syntactic/semantic roles, or some combination of the two. In the caption a man standing in a room with a suitcase, the initial a and man might respectively occupy the positional roles of and ; standing might occupy the syntactic role of verb; in the role of Spatial-P(reposition); while a room with a suitcase might fill a 5-role schema . In fact we will provide evidence in Sec. 5.3.2 that our network learns just this kind of hybrid role decomposition; further evidence for these particular roles is presented elsewhere.

What form of information does the sentence-encoding subnetwork need to encode in ? Continuing with the example of the previous paragraph, needs to be some approximation to the TPR summing several filler/role binding matrices. In one of these bindings, a filler vector — which the lexical subnetwork will map to the article a — is bound (via the outer product) to a role vector which is the dual of the first unbinding vector produced by the unbinding subnetwork : . In the first iteration of generation the model computes , which then maps to a. Analogously, another binding approximately contained in is . There are corresponding approximate bindings for the remaining words of the caption; these employ syntactic/semantic roles. One example is . At iteration 3, decides the next word should be a verb, so it generates the unbinding vector which when multiplied by the current output of , the matrix , yields a filler vector which maps to the output standing. decided the caption should deploy standing as a verb and included in an approximation to the binding . It similarly decided the caption should deploy in as a spatial preposition, approximately including in the binding ; and so on for the other words in their respective roles in the caption.

4 System Description

Figure 2: The sentence-encoding subnet and the unbinding subnet are inter-connected LSTMs; encodes the visual input while the encode the words of the output caption.

As stated above, the unbinding subnetwork and the sentence-encoding subnetwork of Fig. 1 are each implemented as (1-layer, 1-directional) LSTMs (see Fig. 2); the lexical subnetwork

is implemented as a linear transformation followed by a softmax operation.

In the equations below, the LSTM variables internal to the subnet are indexed by 1 (e.g., the forget-, input-, and output-gates are respectively ) while those of the unbinding subnet are indexed by 2.

Thus the state updating equations for are, for = caption length:


Here , , , , , , ;

is the (element-wise) logistic sigmoid function;

is the hyperbolic tangent function; the operator denotes the Hadamard (element-wise) product; , , , , , , , , . For clarity, biases — included throughout the model — are omitted from all equations in this paper. The initial state is initialized by:



is the vector of visual features extracted from the current image by ResNet

[10] and is the mean of all such vectors; . On the output side, is a 1-hot vector with dimension equal to the size of the caption vocabulary, , and is a word embedding matrix, the -th column of which is the embedding vector of the -th word in the vocabulary; it is obtained by the Stanford GLoVe algorithm with zero mean [25]. is initialized as the one-hot vector corresponding to a “start-of-sentence” symbol.

For in Fig. 1, the state updating equations are:


Here , , , , , and , , , . The initial state is the zero vector.

The dimensionality of the crucial vectors shown in Fig. 1, and , is increased from to as follows. A block-diagonal matrix is created by placing copies of the matrix as blocks along the principal diagonal. This matrix is the output of the sentence-encoding subnetwork . Now the ‘filler vector’ — ‘unbound’ from the sentence representation with the ‘unbinding vector’ — is obtained by Eq. (16).


Here , the output of the unbinding subnetwork , is computed as in Eq. (17), where is ’s output weight matrix.


Finally, the lexical subnetwork produces a decoded word by


where is the softmax function and is the overall output weight matrix. Since plays the role of a word de-embedding matrix, we can set


where is the word-embedding matrix. Since is pre-defined, we directly set by Eq. (19) without training through Eq. (18). Note that and are learned jointly through the end-to-end training as shown in Algorithm 1.

Input: Image feature vector and corresponding caption , , ( , , ), where is the total number of samples.
Output: ,

1:  Initialize by (9);
2:  Initialize as the one-hot vector corresponding to the start-of-sentence symbol;
3:  Initialize as the zero vector;
4:  Randomly initialize weights ,,,;
5:  for  from to  do
6:     for  from to  do
7:        Calculate (3) – (8) to obtain ;
8:        Calculate (10) – (15) to obtain ;
9:        Calculate (17) to obtain ;
10:        Calculate (16) to obtain ;
11:        Calculate (18) to obtain ;
12:        Update weights ,,, by the back-propagation algorithm;
13:     end for
14:  end for
Algorithm 1 End-to-end training of and

5 Experimental results

5.1 Dataset

To evaluate the performance of our proposed model, we use the COCO dataset [6]. The COCO dataset contains 123,287 images, each of which is annotated with at least 5 captions. We use the same pre-defined splits as in [14, 10]: 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. We use the same vocabulary as that employed in [10], which consists of 8,791 words.

5.2 Evaluation

For the CNN of Fig. 1, we used ResNet-152 [12]

, pretrained on the ImageNet dataset. The feature vector

has 2048 dimensions. Word embedding vectors in are downloaded from the web [25]

. The model is implemented in TensorFlow


with the default settings for random initialization and optimization by backpropagation.

In our experiments, we choose (where is the dimension of vector ). The dimension of is (while is ); the vocabulary size ; the dimension of and is .

NIC [33] 0.237 0.666 0.461 0.329 0.246 0.855
CNN-LSTM 0.238 0.698 0.525 0.390 0.292 0.889
TPGN 0.243 0.709 0.539 0.406 0.305 0.909
Table 1: Performance of the proposed TPGN model on the COCO dataset.

The main evaluation results on the MS COCO dataset are reported in Table 1. The widely-used BLEU [24], METEOR [3], and CIDEr [32] metrics are reported in our quantitative evaluation of the performance of the proposed model. In evaluation, our baseline is the widely used CNN-LSTM captioning method originally proposed in [33]. For comparison, we include results in that paper in the first line of Table 1. We also re-implemented the model using the latest ResNet features and report the results in the second line of Table 1. Our re-implementation of the CNN-LSTM method matches the performance reported in [10], showing that the baseline is a state-of-the-art implementation. As shown in Table 1, compared to the CNN-LSTM baseline, the proposed TPGN significantly outperforms the benchmark schemes in all metrics across the board. The improvement in BLEU- is greater for greater ; TPGN particularly improves generation of longer subsequences. The results attest to the effectiveness of the TPGN architecture.

5.3 Interpretation of learned unbinding vectors

To get a sense of how the sentence encodings learned by TPGN approximate TPRs, we now investigate the meaning of the role-unbinding vector the model uses to unbind from — via Eq. (16) — the filler vector that produces — via Eq. (18) — the one-hot vector of the generated caption word. The meaning of an unbinding vector is the meaning of the role it unbinds. Interpreting the unbinding vectors reveals the meaning of the roles in a TPR that approximates.

Figure 3: Unbinding vectors of 1000 words; different POS tags of words are represented by different colors.

5.3.1 Visualization of

We run the TPGN model with 5,000 test images as input, and obtain the unbinding vector used to generate each word in the caption of a test image. We plot 1,000 unbinding vectors , which correspond to the first 1,000 words in the resulting captions of these 5,000 test images. There are 17 parts of speech (POS) in these 1,000 words. The POS tags are obtained by the Stanford Parser [20].

We use the Embedding Projector in TensorBoard [11] to plot 1,000 unbinding vectors with a custom linear projection in TensorBoard to reduce 625 dimensions of to 2 dimensions shown in Fig. 3 through Fig. 7.

Fig. 3 shows the unbinding vectors of 1000 words; different POS tags of words are represented by different colors. In fact, we can partition the 625-dim space of into 17 regions, each of which contains 76.3% words of the same type of POS on average; i.e., each region is dominated by words of one POS type. This clearly indicates that each unbinding vector contains important grammatical information about the word it generates. As examples, Fig. 4 to Fig. 7 show the distribution of the unbinding vectors of nouns, verbs, adjectives, and prepositions, respectively.

Figure 4: Unbinding vectors of 360 nouns in red and 640 words of other types of POS in grey.
Figure 5: Unbinding vectors of 81 verbs in red and 919 words of other types of POS in grey.
Figure 6: Unbinding vectors of 55 adjectives in red and 945 words of other types of POS in grey.
Figure 7: Unbinding vectors of 169 prepositions in red and 831 words of other types of POS in grey.
Nouns 16683 16115 0.969
Pronouns 462 442 0.957
Indefinite articles 7248 7107 0.981
Definite articles 797 762 0.956
Adjectives 2543 2237 0.880
Verbs 3558 3409 0.958
Prepositions & conjunctions 8184 7859 0.960
Adverbs 13 8 0.615
Table 2: Conformity to N/V generalization (.
ID Interpretation (proportion)
2 Position 1 (1.00)
3 Position 2 (1.00)
1 Noun (0.54), Determiner (0.43)
5 Determiner (0.50), Noun (0.19), Preposition (0.15)
7 Noun (0.88), Adjective (0.09)
9 Determiner (0.90), Noun (0.10)
0 Preposition (0.64), . (0.16), V (0.14)
4 Preposition: spatial (0.72) non-spatial (0.19)
6 Preposition (0.59), . (0.14)
8 Verb (0.37), Preposition (0.36), . (0.20)
Table 3: Interpretation of unbinding clusters (

5.3.2 Clustering of

Since the previous section indicates that there is a clustering structure for , in this section we partition into clusters and examine the grammar roles played by .

First, we run the trained TPGN model on the 113,287 training images, obtaining the role-unbinding vector used to generate each word in the caption sentence. There are approximately 1.2 million

vectors over all the training images. We apply the K-means clustering algorithm to these vectors to obtain

clusters and the centroid of each cluster ().

Then, we run the TPGN model with 5,000 test images as input, and obtain the role vector of each word in the caption sentence of a test image. Using the nearest neighbor rule, we obtain the index of the cluster that each is assigned to.

The partitioning of the unbinding vectors into clusters exposes the most fundamental distinction made by the roles. We find that the vectors assigned to Cluster 1 generate words which are nouns, pronouns, indefinite and definite articles, and adjectives, while the vectors assigned to Cluster 0 generate verbs, prepositions, conjunctions, and adverbs. Thus Cluster 1 contains the noun-related words, Cluster 0 the verb-like words (verbs, prepositions and conjunctions are all potentially followed by noun-phrase complements, for example). Cross-cutting this distinction is another dimension, however: the initial word in a caption (always a determiner) is sometimes generated with a Cluster 1 unbinding vector, sometimes with a Cluster 0 vector. Outside the caption-initial position, exceptions to the nominal/verbal Cluster 1/0 generalization are rare, as attested by the high rates of conformity to the generalization shown in Table 2 .

Table 2 shows the likelihood of correctness of this ‘N/V’ generalization for the words in 5,000 sentences captioned for the 5,000 test images; is the number of words in the category, is the number of words conforming to the generalization, and is the proportion conforming. We use the Natural Language Toolkit [22] to identify the part of speech of each word in the captions.

A similar analysis with clusters reveals the results shown in Table 3; these results concern the first 100 captions, which were inspected manually to identify interpretable patterns. (More comprehensive results will be discussed elsewhere.)

The clusters can be interpreted as falling into 3 groups (see Table 3). Clusters 2 and 3 are clearly positional roles: every initial word is generated by a role-unbinding vector from Cluster 2, and such vectors are not used elsewhere in the string. The same holds for Cluster 3 and the second caption word.

For caption words after the second word, position is replaced by syntactic/semantic properties for interpretation purposes. The vector clusters aside from 2 and 3 generate words with a dominant grammatical category: for example, unbinding vectors assigned to the cluster 4 generate words that are 91% likely to be prepositions, and 72% likely to be spatial prepositions. Cluster 7 generates 88% nouns and 9% adjectives, with the remaining 3% scattered across other categories. As Table 3 shows, clusters 1, 5, 7, 9 are primarily nominal, and 0, 4, 6, and 8 primarily verbal. (Only cluster 5 spans the N/V divide.)

6 Related work

This work follows a great deal of recent caption-generation literature in exploiting end-to-end deep learning with a CNN image-analysis front end producing a distributed representation that is then used to drive a natural-language generation process, typically using RNNs [21, 33, 7, 4, 8, 14, 15, 16]. Our grammatical interpretation of the structural roles of words in sentences makes contact with other work that incorporates deep learning into grammatically-structured networks [31, 18, 17, 2, 34, 19, 30, 26]. Here, the network is not itself structured to match the grammatical structure of sentences being processed; the structure is fixed, but is designed to support the learning of distributed representations that incorporate structure internal to the representations themselves — filler/role structure.

TPRs are also used in NLP in [23] but there the representation of each individual input word is constrained to be a literal TPR filler/role binding. (The idea of using the outer product to construct internal representations was also explored in [9].) Here, by contrast, the learned representations are not themselves constrained, but the global structure of the network is designed to display the somewhat abstract property of being TPR-capable: the architecture uses the TPR unbinding operation of the matrix-vector product to extract individual words for sequential output.

7 Conclusion

Tensor Product Representation (TPR) [27] is a general technique for constructing vector embeddings of complex symbol structures in such a way that powerful symbolic functions can be computed using hand-designed neural network computation. Integrating TPR with deep learning is a largely open problem for which the work presented here proposes a general approach: design deep architectures that are TPR-capable — TPR computation is within the scope of the capabilities of the architecture in principle. For natural language generation, we proposed such an architecture, the Tensor Product Generation Network (TPGN): it embodies the TPR operation of unbinding which is used to extract particular symbols (e.g., words) from complex structures (e.g., sentences). The architecture can be interpreted as containing a part that encodes a sentence and a part that selects one structural role at a time to extract from the sentence. We applied the approach to image-caption generation, developing a TPGN model that was evaluated on the COCO dataset, on which it outperformed LSTM baselines on a range of standard metrics. Unlike standard LSTMs, however, the TPGN model admits a level of interpretability: we can see which roles are being unbound by the unbinding vectors generated internally within the model. We find such roles contain considerable grammatical information, enabling POS tag prediction for the words they generate and displaying clustering by POS.