Attentive Tensor Product Learning for Language Generation and Grammar Parsing

02/20/2018 ∙ by Qiuyuan Huang, et al. ∙ University of Florida Microsoft IEEE berkeley college 0

This paper proposes a new architecture - Attentive Tensor Product Learning (ATPL) - to represent grammatical structures in deep learning models. ATPL is a new architecture to bridge this gap by exploiting Tensor Product Representations (TPR), a structured neural-symbolic model developed in cognitive science, aiming to integrate deep learning with explicit language structures and rules. The key ideas of ATPL are: 1) unsupervised learning of role-unbinding vectors of words via TPR-based deep neural network; 2) employing attention modules to compute TPR; and 3) integration of TPR with typical deep learning architectures including Long Short-Term Memory (LSTM) and Feedforward Neural Network (FFNN). The novelty of our approach lies in its ability to extract the grammatical structure of a sentence by using role-unbinding vectors, which are obtained in an unsupervised manner. This ATPL approach is applied to 1) image captioning, 2) part of speech (POS) tagging, and 3) constituency parsing of a sentence. Experimental results demonstrate the effectiveness of the proposed approach.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL) is an important tool in many natural language processing (NLP) applications. Since natural languages are rich in grammatical structures, there is an increasing interest in learning a vector representation to capture the grammatical structures of the natural language descriptions using deep learning models

[1, 2, 3].

In this work, we propose a new architecture, called Attentive Tensor Product Learning (ATPL), to address this representation problem by exploiting Tensor Product Representations (TPR) [4, 5]. TPR is a structured neural-symbolic model developed in cognitive science over 20 years ago. In the TPR theory, a sentence can be considered as a sequences of roles (i.e., grammatical components) with each filled with a filler (i.e., tokens). Given each role associated with a role vector and each filler associated with a filler vector , the TPR of a sentence can be computed as . Comparing with the popular RNN-based representations of a sentence, a good property of TPR is that decoding a token of a timestamp can be computed directly by providing an unbinding vector . That is, . Under the TPR theory, encoding and decoding a sentence is equivalent to learning the role vectors or unbinding vectors at each position .

We employ the TPR theory to develop a novel attention-based neural network architecture for learning the unbinding vectors

to serve the core at ATPL. That is, ATPL employs a form of the recurrent neural network to produce

one at a time. In each time, the TPR of the partial prefix of the sentence up to time is leveraged to compute the attention maps, which are then used to compute the TPR as well as the unbinding vector at time . In doing so, our ATPL can not only be used to generate a sequence of tokens, but also be used to generate a sequence of roles, which can interpret the syntactic/semantic structures of the sentence.

To demonstrate the effectiveness of our ATPL architecture, we apply it to three important NLP tasks: 1) image captioning; 2) POS tagging; and 3) constituency parsing of a sentence. The first showcases our ATPL-based generator, while the later two are used to demonstrate the power of role vectors in interpreting sentences’ syntactic structures. Our evaluation shows that on both image captioning and POS tagging, our approach can outperform previous state-of-the-art approaches. In particular, on the constituency parsing task, when the structural segmentation is given as a ground truth, our ATPL approach can beat the state-of-the-art by points to points on the Penn TreeBank dataset. These results demonstrate that our ATPL is more effective at capturing the syntactic structures of natural language sentences.

The paper is organized as follows. Section 2 discusses related work. In Section 3, we present the design of ATPL. Section 4 through Section 6 describe three applications of ATPL, i.e., image captioner, POS tagger, and constituency parser, respectively. Section 7 concludes the paper.

2 Related work

Our proposed image captioning system follows a great deal of recent caption-generation literature in exploiting end-to-end deep learning with a CNN image-analysis front end producing a distributed representation that is then used to drive a natural-language generation process, typically using RNNs

[6, 7, 8]. Our grammatical interpretation of the structural roles of words in sentences makes contact with other work that incorporates deep learning into grammatically-structured networks [1, 9, 10, 11]. Here, the network is not itself structured to match the grammatical structure of sentences being processed; the structure is fixed, but is designed to support the learning of distributed representations that incorporate structure internal to the representations themselves — filler/role structure.

The second task we consider is POS tagging. Methods for automatic POS tagging include unigram tagging, bigram tagging, tagging using Hidden Markov Models (which are generative sequence models), maximum entropy Markov models (which are discriminative sequence models), rule-based tagging, and tagging using bidirectional maximum entropy Markov models

[12]. The celebrated Stanford POS tagger of [13] uses a bidirectional version of the maximum entropy Markov model called a cyclic dependency network in [14].

Methods for automatic constituency parsing of a sentence, our third task, include methods based on probabilistic context-free grammars (CFGs) [12], the shift-reduce method [15], sequence-to-sequence LSTMs [16]. Our constituency parser is similar to the sequence-to-sequence LSTMs [16] since both use LSTM neural networks to design a constituency parser. Different from [16], our constituency parser uses TPR and unbinding role vectors to extract features that contain grammatical information.

Figure 1: ATPL Architecture.

3 Attentive Tensor Product Learning

In this section, we present the ATPL architecture. We will first briefly revisit the Tensor Product Representation (TPR) theory, and then introduce several building blocks. In the end, we explain the ATPL architecture, which is illustrated in Figure 1.

3.1 Background: Tensor Product Representation

The TPR theory allows computing a vector representation of a sentence as the summation of its individual tokens while the order of the tokens is within consideration. For a sentence of words, denoted by , TPR theory considers the sentence as a sequence of grammatical role slots with each slot filled with a concrete token . The role slot is thus referred to as a role, while the token is referred to as a filler.

The TPR of the sentence can thus be computed as binding each role with a filler. Mathematically, each role is associated with a role vector , and a filler with a filler vector . Then the TPR of the sentence is

(1)

where . Each role is also associated with a dual unbinding vector so that and ; then

(2)

Intuitively, Eq. (2) requires that , where , , and

is an identity matrix. In a simplified case, i.e.,

is orthogonal to each other and , we can easily derive .

Eq. (1) and (2) provide means to binding or unbinding a TPR. Through these mechanisms, it is also easy to construct an encoder and a decoder to convert between a sentence and its TPR. All we need to compute is the role vector (or its dual unbinding vector ) at each timestep . One simple approach is to compute it as the hidden states of a recurrent neural network (e.g., LSTM). However, this simple strategy may not yield the best performance.

3.2 Building blocks

Before we start introducing ATPL, we first introduce several building blocks repeatedly used in our construction.

An attention module over an input vector is defined as

(3)

where

is the sigmoid function,

, , is the dimension of , and is the dimension of the output. Intuitively, will output a vector as the attention heatmap; and is equal to the dimension that the heatmap will be attended to. and are two sets of parameters. Without specific notices, the sets of parameters of different attention modules are disjoint to each other.

We refer to a Feed-Forward Neural Network (FFNN) module as a single fully-connected layer:

(4)

where and are the parameter matrix and the parameter vector with appropriate dimensions respectively, and is the hyperbolic tangent function.

3.3 ATPL architecture

In this paper, we mainly focus on an ATPL decoder architecture that can decode a vector representation into a sequence . The architecture is illustrated in Fig. 1.

We notice that, if we require the role vectors to be orthogonal to each other, then to decode the filler only needs to unbind the TPR of undecoded words, :

(5)

where

is a one-hot encoding vector of dimension

and is the size of the vocabulary; is a word embedding matrix, the -th column of which is the embedding vector of the -th word in the vocabulary; the embedding vectors are obtained by the Stanford GLoVe algorithm with zero mean [17].

To compute and , ATPL employs two attention modules controlled by , which is the TPR of the so-far generated words :

On one hand, is computed as follows:

(6)
(7)

where is the point-wise multiplication, concatenates two vectors, and vectorizes a matrix. In this construction, is the hidden state of an external LSTM, which we will explain later.

The key idea here is that we employ an attention model to put weights on each dimension of the image feature vector

, so that it can be used to compute . Note it has been demonstrated that that attention structures can be used to effectively learn any function [18]. Our work adopts a similar idea to compute from and .

On the other hand, similarly, is computed as follows:

where is a constant normalized Hadamard matrix.

In doing so, ATPL can decode an image feature vector by recursively 1) computing and from , 2) computing as , and 3) setting and updating . This procedure continues until the full sentence is generated.

Figure 2: Architecture of image captioning.
Methods METEOR BLEU-1 BLEU-2 BLEU-3 BLEU-4 CIDEr
NIC [7] 0.237 0.666 0.461 0.329 0.246 0.855
CNN-LSTM [19] 0.238 0.698 0.525 0.390 0.292 0.889
SCN-LSTM [19] 0.257 0.728 0.566 0.433 0.330 1.012
ATPL 0.258 0.733 0.572 0.437 0.335 1.013
Table 1: Performance of the proposed ATPL model on the COCO dataset.

Next, we will present three applications of ATPL, i.e., image captioner, POS tagger, and constituency parser in Section 4 through Section 6, respectively.

4 Image Captioning

To showcase our ATPL architecture, we first study its application in the image captioning task. Given an input image , a standard encoder-decoder can be employed to convert the image into an image feature vector , and then use the ATPL decoder to convert it into a sentence. The overall architecture is dipected in Fig. 2.

We evaluate our approach with several baselines on the COCO dataset [20]. The COCO dataset contains 123,287 images, each of which is annotated with at least 5 captions. We use the same pre-defined splits as [8, 19]: 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. We use the same vocabulary as that employed in [19], which consists of 8,791 words.

For the CNN of Fig. 1, we used ResNet-152 [21]

, pretrained on the ImageNet dataset. The image feature vector

has 2048 dimensions. The model is implemented in TensorFlow

[22]

with the default settings for random initialization and optimization by backpropagation. In our ATPL architecture, we choose

, and the size of the LSTM hidden state to be . The vocabulary size . ATPL uses tags as in [19].

In comparison, we compare with [7] and the state-of-the-art CNN-LSTM and SCN-LSTM [19]. The main evaluation results on the MS COCO dataset are reported in Table 1. The widely-used BLEU [23], METEOR [24], and CIDEr [25] metrics are reported in our quantitative evaluation of the performance of the proposed scheme.

We can observe that, our ATPL architecture significantly outperforms all other baseline approaches across all metrics being considered. The results clearly attest to the effectiveness of the ATPL architecture. We attribute the performance gain of ATPL to the use of TPR in replace of a pure LSTM decoder, which allows the decoder to learn not only how to generate the filler sequence but also how to generate the role sequence so that the decoder can better understand the grammar of the considered language. Indeed, by manually inspecting the generated captions from ATPL, none of them has grammatical mistakes. We attribute this to the fact that our TPR structure enables training to be more effective and more efficient in learning the structure through the role vectors.

Note that the focus of this paper is on developing a Tensor Product Representation (TPR) inspired network to replace the core layers in an LSTM; therefore, it is directly comparable to an LSTM baseline. So in the experiments, we focus on comparison to a strong CNN-LSTM baseline. We acknowledge that more recent papers reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others. However, the LSTM is still playing a core role in these works and we believe improvement over the core LSTM, in both performance and interpretability, is still very valuable. Deploying these new features and architectures (R-CNN, RL, and ensemble) with ATPL is our future work.

Figure 3: Architecture for acquisition of unbinding vectors of a sentence.

5 POS Tagging

In this section, we study the application of ATPL in the POS tagging task. Intuitively, given a sentence , POS tagging is to assign a POS tag denoted as , for each token . In the following, we first present our model using ATPL for POS tagging, and then evaluate its performance.

5.1 ATPL POS tagging architecture

Based on TPR theory, the role vector (as well as its dual unbinding vector) contains the POS tag information of each word. Hence, we first use ATPL to compute a sequence of unbinding vectors which is of the same length as the input sentence. Then we take and as input to a bidirectional LSTM model to produce a sequence of POS tags.

Our training procedure consists of two steps. In the first step, we employ an unsupervised learning approach to learn how to compute . Fig. 3 shows a sequence-to-sequence structure, which uses an LSTM as the encoder, and ATPL as the decoder; during the training phase of Fig. 3, the input is a sentence and the expected output is the same sentence as the input. Then we use the trained system in Fig. 3 to produce the unbinding vectors for a given input sentence .

In the second step, we employ a bidirectional LSTM (B-LSTM) module to convert the sequence of into a sequence of hidden states . Then we compute a vector from each pair, which is the POS tag at position . This procedure is illustrated in Figure 4.

Figure 4: Structure of POS tagger.

The first step follows ATPL and is straightforward. Below, we focus on explaining the second step. In particular, given the input sequence , we can compute the hidden states as

(8)

Then, the POS tag embedding is computed as

(9)

Here is computed as follows

(10)

where constructs a diagonal matrix from the input vector; are matrices of appropriate dimensions. is defined in the same manner as , though a different set of parameters is used.

Note that is of dimension , which is the total number of POS tags. Clearly, this model can be trained end-to-end by minimizing a cross-entropy loss.

5.2 Evaluation

To evaluate the effectiveness of our model, we test it using the Penn TreeBank dataset [26]. In particular, we first train the sequence-to-sequence in Fig. 3 using the sentences of Wall Street Journal (WSJ) Section 0 through Section 21 and Section 24 in Penn TreeBank data set [26]. Afterwards, we use the same dataset to train the B-LSTM module in Figure 4.


[13] Our POS tagger
WSJ 22 WSJ 23 WSJ 22 WSJ 23
Accuracy 0.972 0.973 0.973 0.974
Table 2: Performance of POS Tagger.

Once the model gets trained, we test it on WSJ Section 22 and 23 respectively. We compare the accuracy of our approach against the state-of-the-art Stanford parser [13]. The results are presented in Table 2. From the table, we can observe that our approach outperforms the baseline. This confirms our hypothesis that the unsupervisely trained unbinding vector indeed captures grammatical information, so as to be used to effectively predict grammar structures such as POS tags.

Figure 5: The parse tree of a sentence and its layers.

6 Constituency Parsing

In this section, we briefly review the constituency parsing task, and then present our approach, which contains three component: segmenter, classifier, and creator of a parse tree. In the end, we compare our approach against the state-of-the-art approach in 

[16].

6.1 A brief review of constituency parsing

Constituency parsing converts a natural language into its parsing tree. Fig. 5 provides an example of the parsing tree on top of its corresponding sentence. From the tree, we can label each node into layers, with the first layer (Layer 0) consisting of all tokens from the original sentence. Layer contains all internal nodes whose depth with respect to the closest leaf that it can reach is .

In particular, at Layer 1 are all POS tags associated with each token. In higher layers, each node corresponds to a substring, a consecutive subsequence, of the sentence. Each node corresponds to a grammar structure, such as a single word, a phrase, or a clause, and is associated with a category. For example, in Penn TreeBank, there are over 70 types of categories, including (1) clause-level tags such as S (simple declarative clause), (2) phrase-level tags such as NP (noun phrase), VP (verb phrase), (3) word-level tags such as NNP (Proper noun, singular), VBD (Verb, past tense), DT (Determiner), NN (Noun, singular or mass), (4) punctuation marks, and (5) special symbols such as $.

The task of constituency parsing recovers both the tree-structure and the category associated with each node. In our approach to employ ATPL to construct the parsing tree, we use an encoding to encode the tree-structure. Our approach first generates this encoding from the raw sentence, layer-by-layer, and then predict a category to each internal node. In the end, an algorithm is used to convert the encoding with the categories into the full parsing tree. In the following, we present the three sub-routines.

6.2 Segmenting a sentence into a tree-encoding

We first introduce the concept of the encoding . For each layer , we assign a value to each location of the input sentence. In the first layer, simply encodes the POS tag of input token . In a higher level, is either or . Thus the sequence forms a sequence with alternating sub-sequences of consecutive 0s and consecutive 1s. Each of the longest consecutive 0s or consecutive 1s indicate one internal node at layer , and the consecutive positions form the substring of the node. For example, the second layer of Fig. 5 is encoded as , and the third layer is encoded as .

The first component of our ATPL-based parser predicts layer-by-layer. Note that the first layer is simply the POS tags, so we will not repeat it. In the following, we first explain how to construct the second layer’s encoding , and then we show how it can be expanded to construct higher layer’s encoding for .

Figure 6: Structure of the segmenter on Layer 2.

Constructing the second layer .

We can view as a special tag over the POS tag sequence, and thus the same approach to compute the POS tag can be adapted here to compute . This model is illustrated in Fig. 6.

In particular, we can compute the hidden state from the unbinding vectors from the raw sentence as before:

(11)

and the output of the attention-based B-LSTM is given as below

(12)

where and are defined in the same manner as in (10).

Figure 7: Structure of the segmenter on Layer .
Figure 8: Segmenting Layer .

[16] Our parser Our parser with ground-truth ()
WSJ 22 WSJ 23 WSJ 22 WSJ 23 WSJ 22 WSJ 23
Precision N/A N/A 0.898 0.910 0.952 0.952
Recall N/A N/A 0.901 0.907 0.973 0.978
F-1 measure 0.928 0.921 0.900 0.908 0.963 0.965
Table 3: Performance of Constituency Parser.

Constructing higher layer’s encoding .

Now we move to higher levels. For a layer , to predict , our model takes both the POS tag input and the -th layer’s encoding . The high-level architecture is illustrated in Fig. 7.

Let us denote

the key difference is how to compute . Intuitively, is an embedding vector corresponding to the node, whose substring contains token . Assume word is in the -th substring of Layer , which is denoted by . Then, the embedding can be computed as follows:

(13)

Here, and are the hidden states of BLSTM running over the unbinding vectors as before, and and are defined in a similar fashion as (10). We use to indicate the cardinality of a set.

The most interesting part is that aggregates all embeddings computed from the substring of the previous layer . Note that the set of indexes can be computed easily from . Note that many different aggregation functions can be used. In (13), we choose to use the average function. The process of this calculuation is illustrated in Fig. 8.

Figure 9: Structure of the classifier on Layer .

6.3 Classification of Substrings

Once the tree structure is computed, we attach a category to each internal node. We employ a similar approach as predicting for to predict this category . Note that, in this time, the encoding of the internal node is already computed. Thus, instead of using the encoding from the previous layer, we use the encoding of the current layer to predict directly. This procedure is illustrated in Fig. 9.

Similar to (13), we have , where is computed by ()

(14)

Here, we slightly overload the variable names. We emphasize that the parameters and and the hidden states and are both independent to the ones used in (14).

Note that the main different between (14) and (13) is that, the aggregation is operated over the set , i.e., the substring at layer , rather than , i.e., the substring at layer . Also, ’s dimension is the same as the total number of categories, while ’s dimension is 2.

6.4 Creation of a Parse Tree

Once both and are constructed, we can create the parse tree out of them using a linear-time sub-routine. We rely on Algorithm 1 to this end. For the example in Fig. 5, the output is (S(NNP John)(VP(VBD hit)(NP(DT the)(NN ball)))).

   Input: (; )
   i=0
   for  to  do
       for  to  do
           if  then
               if  then
                    output “(” and
                    push into the stack
                    if  then
                        output and “)”
                        pop out of the stack
                    end if
               else
                    if  then
                        output “(” and
                        push into the stack
                        if  then
                            output and “)”
                            pop out of the stack
                        end if
                    end if
               end if
           else
               if  &  then
                    output “(” and
                    push into the stack
                    if  then
                        output and “)”
                        pop out of the stack
                        if  or  then
                            while the stack is not empty do
                                 pop an element out of the stack
                                 if the substring of the element ends at  then
                                     output “)”
                                 else
                                     push the element back into the stack
                                 end if
                            end while
                        end if
                    end if
               end if
           end if
       end for
   end for
Algorithm 1 Creation of a constituency parse tree

6.5 Evaluation

We now evaluate our constituency parsing approach against the state-of-the-art approach [16] using WSJ data set in Penn TreeBank. Similar to our setup for POS tag, we training our model using WSJ Section 0 through Section 21 and Section 24, and evaluate it on Section 22 and 23.

Table 3 shows the performance for both [16] and our proposed approach. In addition, we also evaluate our approach assuming the tree-structure encoding is known. In doing so, we can evaluate the performance of our classification module of the parser. Note, the POS tag is not provided.

We observe that the F-1 measure of our approach is 2 points worse than [16]; however, when the ground-truth of is provided, the F-1 measure is 4 points higher than [16], which is significant. Therefore, we attribute the reason for our approach’s underperformance to the fact that our model may not be effective enough to learn to predict the tree-encoding .

Remarks.

We view the use of unbinding vectors as the main novelty of our work. In contrast, all other parsers need to input the words directly. Our ATPL separates grammar components of a sentence from its lexical units so that one author’s grammar style can be characterized by unbinding vectors while his word usage pattern can be characterized by lexical units . Hence, our parser enjoys the benefit of aid in learning the writing style of an author since the regularities embedded in unbinding vectors and the obtained parse trees characterize the writing style of an author.

7 Conclusion

In this paper, we proposed a new ATPL approach for natural language generation and related tasks. The model has a novel architecture based on a rationale derived from the use of Tensor Product Representations for encoding and processing symbolic structure through neural network computation. In evaluation, we tested the proposed model on image captioning. Compared to widely adopted LSTM-based models, the proposed ATPL gives significant improvements on all major metrics including METEOR, BLEU, and CIDEr. Moreover, we observe that the unbinding vectors contain important grammatical information, which allows us to design an effective POS tagger and constituency parser with unbinding vectors as input. Our findings in this paper show great promise of TPRs. In the future, we will explore extending TPR to a variety of other NLP tasks.

References