Deep learning (DL) is an important tool in many natural language processing (NLP) applications. Since natural languages are rich in grammatical structures, there is an increasing interest in learning a vector representation to capture the grammatical structures of the natural language descriptions using deep learning models[1, 2, 3].
In this work, we propose a new architecture, called Attentive Tensor Product Learning (ATPL), to address this representation problem by exploiting Tensor Product Representations (TPR) [4, 5]. TPR is a structured neural-symbolic model developed in cognitive science over 20 years ago. In the TPR theory, a sentence can be considered as a sequences of roles (i.e., grammatical components) with each filled with a filler (i.e., tokens). Given each role associated with a role vector and each filler associated with a filler vector , the TPR of a sentence can be computed as . Comparing with the popular RNN-based representations of a sentence, a good property of TPR is that decoding a token of a timestamp can be computed directly by providing an unbinding vector . That is, . Under the TPR theory, encoding and decoding a sentence is equivalent to learning the role vectors or unbinding vectors at each position .
We employ the TPR theory to develop a novel attention-based neural network architecture for learning the unbinding vectors
to serve the core at ATPL. That is, ATPL employs a form of the recurrent neural network to produceone at a time. In each time, the TPR of the partial prefix of the sentence up to time is leveraged to compute the attention maps, which are then used to compute the TPR as well as the unbinding vector at time . In doing so, our ATPL can not only be used to generate a sequence of tokens, but also be used to generate a sequence of roles, which can interpret the syntactic/semantic structures of the sentence.
To demonstrate the effectiveness of our ATPL architecture, we apply it to three important NLP tasks: 1) image captioning; 2) POS tagging; and 3) constituency parsing of a sentence. The first showcases our ATPL-based generator, while the later two are used to demonstrate the power of role vectors in interpreting sentences’ syntactic structures. Our evaluation shows that on both image captioning and POS tagging, our approach can outperform previous state-of-the-art approaches. In particular, on the constituency parsing task, when the structural segmentation is given as a ground truth, our ATPL approach can beat the state-of-the-art by points to points on the Penn TreeBank dataset. These results demonstrate that our ATPL is more effective at capturing the syntactic structures of natural language sentences.
2 Related work
Our proposed image captioning system follows a great deal of recent caption-generation literature in exploiting end-to-end deep learning with a CNN image-analysis front end producing a distributed representation that is then used to drive a natural-language generation process, typically using RNNs[6, 7, 8]. Our grammatical interpretation of the structural roles of words in sentences makes contact with other work that incorporates deep learning into grammatically-structured networks [1, 9, 10, 11]. Here, the network is not itself structured to match the grammatical structure of sentences being processed; the structure is fixed, but is designed to support the learning of distributed representations that incorporate structure internal to the representations themselves — filler/role structure.
The second task we consider is POS tagging. Methods for automatic POS tagging include unigram tagging, bigram tagging, tagging using Hidden Markov Models (which are generative sequence models), maximum entropy Markov models (which are discriminative sequence models), rule-based tagging, and tagging using bidirectional maximum entropy Markov models. The celebrated Stanford POS tagger of  uses a bidirectional version of the maximum entropy Markov model called a cyclic dependency network in .
Methods for automatic constituency parsing of a sentence, our third task, include methods based on probabilistic context-free grammars (CFGs) , the shift-reduce method , sequence-to-sequence LSTMs . Our constituency parser is similar to the sequence-to-sequence LSTMs  since both use LSTM neural networks to design a constituency parser. Different from , our constituency parser uses TPR and unbinding role vectors to extract features that contain grammatical information.
3 Attentive Tensor Product Learning
In this section, we present the ATPL architecture. We will first briefly revisit the Tensor Product Representation (TPR) theory, and then introduce several building blocks. In the end, we explain the ATPL architecture, which is illustrated in Figure 1.
3.1 Background: Tensor Product Representation
The TPR theory allows computing a vector representation of a sentence as the summation of its individual tokens while the order of the tokens is within consideration. For a sentence of words, denoted by , TPR theory considers the sentence as a sequence of grammatical role slots with each slot filled with a concrete token . The role slot is thus referred to as a role, while the token is referred to as a filler.
The TPR of the sentence can thus be computed as binding each role with a filler. Mathematically, each role is associated with a role vector , and a filler with a filler vector . Then the TPR of the sentence is
where . Each role is also associated with a dual unbinding vector so that and ; then
Intuitively, Eq. (2) requires that , where , , and
is an identity matrix. In a simplified case, i.e.,is orthogonal to each other and , we can easily derive .
Eq. (1) and (2) provide means to binding or unbinding a TPR. Through these mechanisms, it is also easy to construct an encoder and a decoder to convert between a sentence and its TPR. All we need to compute is the role vector (or its dual unbinding vector ) at each timestep . One simple approach is to compute it as the hidden states of a recurrent neural network (e.g., LSTM). However, this simple strategy may not yield the best performance.
3.2 Building blocks
Before we start introducing ATPL, we first introduce several building blocks repeatedly used in our construction.
An attention module over an input vector is defined as
is the sigmoid function,, , is the dimension of , and is the dimension of the output. Intuitively, will output a vector as the attention heatmap; and is equal to the dimension that the heatmap will be attended to. and are two sets of parameters. Without specific notices, the sets of parameters of different attention modules are disjoint to each other.
We refer to a Feed-Forward Neural Network (FFNN) module as a single fully-connected layer:
where and are the parameter matrix and the parameter vector with appropriate dimensions respectively, and is the hyperbolic tangent function.
3.3 ATPL architecture
In this paper, we mainly focus on an ATPL decoder architecture that can decode a vector representation into a sequence . The architecture is illustrated in Fig. 1.
We notice that, if we require the role vectors to be orthogonal to each other, then to decode the filler only needs to unbind the TPR of undecoded words, :
is a one-hot encoding vector of dimensionand is the size of the vocabulary; is a word embedding matrix, the -th column of which is the embedding vector of the -th word in the vocabulary; the embedding vectors are obtained by the Stanford GLoVe algorithm with zero mean .
To compute and , ATPL employs two attention modules controlled by , which is the TPR of the so-far generated words :
On one hand, is computed as follows:
where is the point-wise multiplication, concatenates two vectors, and vectorizes a matrix. In this construction, is the hidden state of an external LSTM, which we will explain later.
The key idea here is that we employ an attention model to put weights on each dimension of the image feature vector, so that it can be used to compute . Note it has been demonstrated that that attention structures can be used to effectively learn any function . Our work adopts a similar idea to compute from and .
On the other hand, similarly, is computed as follows:
where is a constant normalized Hadamard matrix.
In doing so, ATPL can decode an image feature vector by recursively 1) computing and from , 2) computing as , and 3) setting and updating . This procedure continues until the full sentence is generated.
4 Image Captioning
To showcase our ATPL architecture, we first study its application in the image captioning task. Given an input image , a standard encoder-decoder can be employed to convert the image into an image feature vector , and then use the ATPL decoder to convert it into a sentence. The overall architecture is dipected in Fig. 2.
We evaluate our approach with several baselines on the COCO dataset . The COCO dataset contains 123,287 images, each of which is annotated with at least 5 captions. We use the same pre-defined splits as [8, 19]: 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. We use the same vocabulary as that employed in , which consists of 8,791 words.
, pretrained on the ImageNet dataset. The image feature vector
has 2048 dimensions. The model is implemented in TensorFlow
with the default settings for random initialization and optimization by backpropagation. In our ATPL architecture, we choose, and the size of the LSTM hidden state to be . The vocabulary size . ATPL uses tags as in .
In comparison, we compare with  and the state-of-the-art CNN-LSTM and SCN-LSTM . The main evaluation results on the MS COCO dataset are reported in Table 1. The widely-used BLEU , METEOR , and CIDEr  metrics are reported in our quantitative evaluation of the performance of the proposed scheme.
We can observe that, our ATPL architecture significantly outperforms all other baseline approaches across all metrics being considered. The results clearly attest to the effectiveness of the ATPL architecture. We attribute the performance gain of ATPL to the use of TPR in replace of a pure LSTM decoder, which allows the decoder to learn not only how to generate the filler sequence but also how to generate the role sequence so that the decoder can better understand the grammar of the considered language. Indeed, by manually inspecting the generated captions from ATPL, none of them has grammatical mistakes. We attribute this to the fact that our TPR structure enables training to be more effective and more efficient in learning the structure through the role vectors.
Note that the focus of this paper is on developing a Tensor Product Representation (TPR) inspired network to replace the core layers in an LSTM; therefore, it is directly comparable to an LSTM baseline. So in the experiments, we focus on comparison to a strong CNN-LSTM baseline. We acknowledge that more recent papers reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others. However, the LSTM is still playing a core role in these works and we believe improvement over the core LSTM, in both performance and interpretability, is still very valuable. Deploying these new features and architectures (R-CNN, RL, and ensemble) with ATPL is our future work.
5 POS Tagging
In this section, we study the application of ATPL in the POS tagging task. Intuitively, given a sentence , POS tagging is to assign a POS tag denoted as , for each token . In the following, we first present our model using ATPL for POS tagging, and then evaluate its performance.
5.1 ATPL POS tagging architecture
Based on TPR theory, the role vector (as well as its dual unbinding vector) contains the POS tag information of each word. Hence, we first use ATPL to compute a sequence of unbinding vectors which is of the same length as the input sentence. Then we take and as input to a bidirectional LSTM model to produce a sequence of POS tags.
Our training procedure consists of two steps. In the first step, we employ an unsupervised learning approach to learn how to compute . Fig. 3 shows a sequence-to-sequence structure, which uses an LSTM as the encoder, and ATPL as the decoder; during the training phase of Fig. 3, the input is a sentence and the expected output is the same sentence as the input. Then we use the trained system in Fig. 3 to produce the unbinding vectors for a given input sentence .
In the second step, we employ a bidirectional LSTM (B-LSTM) module to convert the sequence of into a sequence of hidden states . Then we compute a vector from each pair, which is the POS tag at position . This procedure is illustrated in Figure 4.
The first step follows ATPL and is straightforward. Below, we focus on explaining the second step. In particular, given the input sequence , we can compute the hidden states as
Then, the POS tag embedding is computed as
Here is computed as follows
where constructs a diagonal matrix from the input vector; are matrices of appropriate dimensions. is defined in the same manner as , though a different set of parameters is used.
Note that is of dimension , which is the total number of POS tags. Clearly, this model can be trained end-to-end by minimizing a cross-entropy loss.
To evaluate the effectiveness of our model, we test it using the Penn TreeBank dataset . In particular, we first train the sequence-to-sequence in Fig. 3 using the sentences of Wall Street Journal (WSJ) Section 0 through Section 21 and Section 24 in Penn TreeBank data set . Afterwards, we use the same dataset to train the B-LSTM module in Figure 4.
|||Our POS tagger|
|WSJ 22||WSJ 23||WSJ 22||WSJ 23|
Once the model gets trained, we test it on WSJ Section 22 and 23 respectively. We compare the accuracy of our approach against the state-of-the-art Stanford parser . The results are presented in Table 2. From the table, we can observe that our approach outperforms the baseline. This confirms our hypothesis that the unsupervisely trained unbinding vector indeed captures grammatical information, so as to be used to effectively predict grammar structures such as POS tags.
6 Constituency Parsing
In this section, we briefly review the constituency parsing task, and then present our approach, which contains three component: segmenter, classifier, and creator of a parse tree. In the end, we compare our approach against the state-of-the-art approach in.
6.1 A brief review of constituency parsing
Constituency parsing converts a natural language into its parsing tree. Fig. 5 provides an example of the parsing tree on top of its corresponding sentence. From the tree, we can label each node into layers, with the first layer (Layer 0) consisting of all tokens from the original sentence. Layer contains all internal nodes whose depth with respect to the closest leaf that it can reach is .
In particular, at Layer 1 are all POS tags associated with each token. In higher layers, each node corresponds to a substring, a consecutive subsequence, of the sentence. Each node corresponds to a grammar structure, such as a single word, a phrase, or a clause, and is associated with a category. For example, in Penn TreeBank, there are over 70 types of categories, including (1) clause-level tags such as S (simple declarative clause), (2) phrase-level tags such as NP (noun phrase), VP (verb phrase), (3) word-level tags such as NNP (Proper noun, singular), VBD (Verb, past tense), DT (Determiner), NN (Noun, singular or mass), (4) punctuation marks, and (5) special symbols such as $.
The task of constituency parsing recovers both the tree-structure and the category associated with each node. In our approach to employ ATPL to construct the parsing tree, we use an encoding to encode the tree-structure. Our approach first generates this encoding from the raw sentence, layer-by-layer, and then predict a category to each internal node. In the end, an algorithm is used to convert the encoding with the categories into the full parsing tree. In the following, we present the three sub-routines.
6.2 Segmenting a sentence into a tree-encoding
We first introduce the concept of the encoding . For each layer , we assign a value to each location of the input sentence. In the first layer, simply encodes the POS tag of input token . In a higher level, is either or . Thus the sequence forms a sequence with alternating sub-sequences of consecutive 0s and consecutive 1s. Each of the longest consecutive 0s or consecutive 1s indicate one internal node at layer , and the consecutive positions form the substring of the node. For example, the second layer of Fig. 5 is encoded as , and the third layer is encoded as .
The first component of our ATPL-based parser predicts layer-by-layer. Note that the first layer is simply the POS tags, so we will not repeat it. In the following, we first explain how to construct the second layer’s encoding , and then we show how it can be expanded to construct higher layer’s encoding for .
Constructing the second layer .
We can view as a special tag over the POS tag sequence, and thus the same approach to compute the POS tag can be adapted here to compute . This model is illustrated in Fig. 6.
In particular, we can compute the hidden state from the unbinding vectors from the raw sentence as before:
and the output of the attention-based B-LSTM is given as below
where and are defined in the same manner as in (10).
|||Our parser||Our parser with ground-truth ()|
|WSJ 22||WSJ 23||WSJ 22||WSJ 23||WSJ 22||WSJ 23|
Constructing higher layer’s encoding .
Now we move to higher levels. For a layer , to predict , our model takes both the POS tag input and the -th layer’s encoding . The high-level architecture is illustrated in Fig. 7.
Let us denote
the key difference is how to compute . Intuitively, is an embedding vector corresponding to the node, whose substring contains token . Assume word is in the -th substring of Layer , which is denoted by . Then, the embedding can be computed as follows:
Here, and are the hidden states of BLSTM running over the unbinding vectors as before, and and are defined in a similar fashion as (10). We use to indicate the cardinality of a set.
The most interesting part is that aggregates all embeddings computed from the substring of the previous layer . Note that the set of indexes can be computed easily from . Note that many different aggregation functions can be used. In (13), we choose to use the average function. The process of this calculuation is illustrated in Fig. 8.
6.3 Classification of Substrings
Once the tree structure is computed, we attach a category to each internal node. We employ a similar approach as predicting for to predict this category . Note that, in this time, the encoding of the internal node is already computed. Thus, instead of using the encoding from the previous layer, we use the encoding of the current layer to predict directly. This procedure is illustrated in Fig. 9.
Similar to (13), we have , where is computed by ()
Here, we slightly overload the variable names. We emphasize that the parameters and and the hidden states and are both independent to the ones used in (14).
6.4 Creation of a Parse Tree
We now evaluate our constituency parsing approach against the state-of-the-art approach  using WSJ data set in Penn TreeBank. Similar to our setup for POS tag, we training our model using WSJ Section 0 through Section 21 and Section 24, and evaluate it on Section 22 and 23.
Table 3 shows the performance for both  and our proposed approach. In addition, we also evaluate our approach assuming the tree-structure encoding is known. In doing so, we can evaluate the performance of our classification module of the parser. Note, the POS tag is not provided.
We observe that the F-1 measure of our approach is 2 points worse than ; however, when the ground-truth of is provided, the F-1 measure is 4 points higher than , which is significant. Therefore, we attribute the reason for our approach’s underperformance to the fact that our model may not be effective enough to learn to predict the tree-encoding .
We view the use of unbinding vectors as the main novelty of our work. In contrast, all other parsers need to input the words directly. Our ATPL separates grammar components of a sentence from its lexical units so that one author’s grammar style can be characterized by unbinding vectors while his word usage pattern can be characterized by lexical units . Hence, our parser enjoys the benefit of aid in learning the writing style of an author since the regularities embedded in unbinding vectors and the obtained parse trees characterize the writing style of an author.
In this paper, we proposed a new ATPL approach for natural language generation and related tasks. The model has a novel architecture based on a rationale derived from the use of Tensor Product Representations for encoding and processing symbolic structure through neural network computation. In evaluation, we tested the proposed model on image captioning. Compared to widely adopted LSTM-based models, the proposed ATPL gives significant improvements on all major metrics including METEOR, BLEU, and CIDEr. Moreover, we observe that the unbinding vectors contain important grammatical information, which allows us to design an effective POS tagger and constituency parser with unbinding vectors as input. Our findings in this paper show great promise of TPRs. In the future, we will explore extending TPR to a variety of other NLP tasks.
-  K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” arXiv preprint arXiv:1503.00075, 2015.
A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury,
I. Gulrajani, V. Zhong,
R. Paulus, and R. Socher, “Ask me anything: Dynamic memory networks for
natural language processing,” in
International Conference on Machine Learning, 2016, pp. 1378–1387.
-  L. Kong, C. Alberti, D. Andor, I. Bogatyy, and D. Weiss, “Dragnn: A transition-based framework for dynamically connected neural networks,” arXiv preprint arXiv:1703.04474, 2017.
-  P. Smolensky, “Tensor product variable binding and the representation of symbolic structures in connectionist systems,” Artificial intelligence, vol. 46, no. 1-2, pp. 159–216, 1990.
-  P. Smolensky and G. Legendre, The harmonic mind: From neural computation to optimality-theoretic grammar. Volume 1: Cognitive architecture. MIT Press, 2006.
-  J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” in Proceedings of International Conference on Learning Representations, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in
-  A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” arXiv preprint arXiv:1511.02799, vol. 2, 2015.
-  D. Yogatama, P. Blunsom, C. Dyer, E. Grefenstette, and W. Ling, “Learning to compose words into sentences with reinforcement learning,” arXiv preprint arXiv:1611.09100, 2016.
-  J. Maillard, S. Clark, and D. Yogatama, “Jointly learning sentence embeddings and syntax with unsupervised tree-lstms,” arXiv preprint arXiv:1705.09189, 2017.
-  D. Jurafsky and J. H. Martin, Speech and Language Processing, 3rd ed., 2017.
-  C. Manning, “Stanford parser,” https://nlp.stanford.edu/software/lex-parser.shtml, 2017.
-  K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003, pp. 173–180.
-  M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu, “Fast and accurate shift-reduce constituent parsing.” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2013, pp. 434–443.
-  O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton, “Grammar as a foreign language,” in Advances in Neural Information Processing Systems, 2015, pp. 2773–2781.
-  J. Pennington, R. Socher, and C. Manning, “Stanford glove: Global vectors for word representation,” https://nlp.stanford.edu/projects/glove/, 2017.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
-  Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  COCO, “Coco dataset for image captioning,” http://mscoco.org/dataset/#download, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002, pp. 311–318.
-  S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Association for Computational Linguistics, 2005, pp. 65–72.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575.
-  M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor, “Penn treebank,” https://catalog.ldc.upenn.edu/ldc99t42, 2017.