Structural Attention Neural Networks for improved sentiment analysis

01/07/2017 ∙ by Filippos Kokkinos, et al. ∙ National Technical University of Athens 0

We introduce a tree-structured attention neural network for sentences and small phrases and apply it to the problem of sentiment classification. Our model expands the current recursive models by incorporating structural information around a node of a syntactic tree using both bottom-up and top-down information propagation. Also, the model utilizes structural attention to identify the most salient representations during the construction of the syntactic tree. To our knowledge, the proposed models achieve state of the art performance on the Stanford Sentiment Treebank dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis deals with the assessment of opinions, speculations, and emotions in text [Zhang et al.2012, Pang and Lee2008]. It is a relatively recent research area that has attracted great interest as demonstrated by a series of shared evaluation tasks, e.g., analysis of tweets [Nakov et al.2016]. In [Turney and Littman2002], the affective ratings of unknown words were predicted utilizing the affective ratings of a small set of words (seeds) and the semantic relatedness between the unknown and the seed words. An example of sentence-level analysis was proposed in [Malandrakis et al.2013]. Other application areas include the detection of public opinion and prediction of election results [Singhal et al.2015], correlation of mood states and stock market indices [Bollen et al.2011].

Recently, Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM)

[Hochreiter and Schmidhuber1997]

or Gated Recurrent Units (GRU)

[Chung et al.2014]

have been applied to various Natural Language Processing tasks. Tree structured neural networks, which are found in literature as Recursive Neural Networks, hold a linguistic interest due to their close relation to syntactic structures of sentences being able to capture distributed information of structure such as logical terms

[Socher et al.2012]. These syntactic structures are N-ary trees which represent either the underlying structure of a sentence, known as constituency trees or the relations between words known as dependency trees.

This paper focuses on sentence-level sentiment classification of movie reviews using syntactic parse trees as input for the proposed networks. In order to solve the task of sentiment analysis of sentences, we work upon a variant of Recursive Neural Networks which recursively create representation following the syntactic structure. The proposed computation model exploits information from subnodes as well as parent nodes of the node under examination. This neural network is referred to as Bidirectional Recursive Network [Irsoy and Cardie2013]

. The model is further enhanced with memory units and the proposed structural attention mechanism. It is observed that different nodes of a tree structure hold information of variable saliency. Not all nodes of a tree are equally informative, so the proposed model selectively weights the contribution of each node regarding the sentence level representation using structural attention model.

We evaluate our approach on the sentence-level sentiment classification task using one standard movie review dataset [Socher et al.2013]. Experimental results show that the proposed model outperforms the state-of-the art methods.

2 Tree-Structured GRUs

Recursive GRUs (Tree-Gru) upon tree structures are an extension of the sequential GRUs that allow information to propagate through network topologies. Similar to Recursive LSTM network on tree structures [Tai et al.2015], for every node of a tree, the Tree-GRU has gating mechanisms that modulate the flow of information inside the unit without the need of a separate memory cell. The activation of Tree-GRU for node

is the interpolation of the previous calculated activation

of its child out of total children and the candidate activation .



is the update function which decide the degree of update that will occur on the activation based on the input vector

and previously calculated representation :


The candidate activation for a node is computed similarly to that of a Recursive Neural Network as in [Socher et al.2011]:


where is the reset gate which allows the network to forget effectively previous computed representations when the value is close to 0 and it is computed as follows:


Every part of a gated recurrent unit where d is the input vector dimensionality.

is the sigmoid function and

is the non-linear tanh function.The set of matrices used in 2 - 4 are the trainable weight parameters which connect the children node representation with the node representation and the input vector .

2.1 Bidirectional TreeGRU

A natural extension of Tree-Structure GRU is the addition of a bidirectional approach. TreeGRUs calculate an activation for node with the use of previously computed activations lying lower in the tree structure. The bidirectional approach for a tree structure uses information both from under and lower nodes of the tree for a particular node . In this manner, a newly calculated activation incorporates content from both the children and the parent of a particular node.
The bidirectional neural network can be trained in two seperate phases: i) the Upward phase and ii) the Downward phase. During the Upward phase, the network topology is similar to the topology of a TreeGru, every activation is calculated based on the previously calculated activations which are found lower on the structure in a bottom up fashion. When every activation has been computed, from leaves to root, then the root activation is used as input of the Downward phase. The Downward phase calculates the activations for every child of a node using content from the parent in a top down fashion. The process of computing the internal representations between the two phases is separated, so in a first pass the network compute the upward activation and after this is completed, then the downward representations are computed.
The upward activation similarly to TreeGRU for node is the interpolation of the previous calculated activation of its kth child out of N total children and the candidate activation .


The update gate, rest gate and candidate activation are computed as follows:


Figure 1: A tree-structured bidirectional neural network with Gated Recurrent Units. The input vectors are given to the model in order to generate the phrase representations and .

The downward activation for node is the interpolation of the previous calculated activation , where the function calculates the index of the parent node, and the candidate activation .


The update gate, reset gate and candidate activation for the downward phase are computed as follows:


During downward phase, matrix connects the upward representation of node with the respective downward node while connect the parent representation .

2.2 Structural Attention

We introduce Structural Attention, a generalization of sequential attention model [Luong et al.2015] which extracts informative nodes out of a syntactic tree and aggregates the representation of those nodes in order to form the sentence vector. We feed representation

of node through a one-layer Multilayer Perceptron with

weight matrix to get the hidden representation



Using the softmax function, the weights for each node are obtained based on the similarity of the hidden representation and a global context vetor . The normalized weights are used to form the final sentence representation which is a weighted summation of every node representation .


The proposed attention model is applied on structural content since all node representations contain syntactic structural information during training because of the recursive nature of the network topology.

3 Experiments

We evaluate the performance of the aforementioned models on the task of sentiment classification of sentences sampled from movie reviews. We use the Stanford Sentiment Treebank [Socher et al.2013] dataset which contains sentiment labels for every syntactically plausible phrase out of the 8544/1101/2210 train/dev/test sentences. Each phrase is labeled with respect to a 5-class sentiment value, i.e. very negative, negative, neutral, positive, very positive. The dataset can also be used for a binary classification subtask by excluding any neutral phrases for the original splits. The binary classification subtask is evaluated on 6920/872/1821 train/dev/test splits.

3.1 Sentiment Classification

For all of the aforementioned architectures at each node j we use a softmax classifier to predict the sentiment label

. For example, the predicted label corresponds to the sentiment class of the spanned phrase produced from node j. The classifier for unidirectional TreeGRU architectures uses the hidden state produced from recursive computations till node j using a set of input nodes to predict the label as follows:


where and is the number of sentiment classes.

The classifier for bidirectional TreeBiGRU architectures uses both the hidden state and produced from recursive computations till node j during Upward and Downward Phase using a set of input nodes to predict the label as follows:


where and is the number of sentiment classes. The predicted label is the argument with the maximum confidence:


For the Structural Attention models, we use for the final sentence representation to predict the sentiment label where j is the corresponding root node of a sentence. The cost function used is the negative log-likelihood of the ground-truth label at each node:


where m is the number of labels in a training sample and

is the L2 regularization hyperparameter.

Network Variant d
-without attention 300 7323005
-with attention 300 7413605
-without attention 300 8135405
-with attention 300 8317810
Table 1: Memory dimensions d and total network parameters for every network variant evaluated

3.2 Results

The evaluation results are presented in Table 2 in terms of accuracy, for several state-of-the-art models proposed in the literature as well as for the TreeGRU and TreeBiGRU models proposed in this work. Among the approaches reported in the literature, the highest accuracy is yielded by DRNN and DMN for the binary scheme (88.6), and by DMN for the fine-grained scheme (52.1). We observe that the best performance is achieved by TreeBiGRU with attention, for both binary (89.5) and fine-grained (52.4) evaluation metrics, exceeding any previously reported results. In addition, the attentional mechanism employed in the proposed TreeGRU and TreeBiGRU models improve the performance for both evaluation metrics.

System Binary Fine-grained
RNN 82.4 43.2
MV-RNN 82.9 44.4
RNTN 85.4 45.7
PVec 87.8 48.7
TreeLSTM 88.0 51.0
DRNN 86.6 49.8
DCNN 86.8 48.5
CNN-multichannel 88.1 47.4
DMN 88.6 52.1
- without attention 88.6 50.5
- with attention 89.0 51.0
- without attention 88.5 51.3
- with attention 89.5 52.4
Table 2: Test Accuracies achieved on the Stanford Sentiment Treebank dataset. RNN, MV-RNN and RNTN [Socher et al.2013]. PVec: [Mikolov et al.2013]. TreeLSTM [Tai et al.2015]. DRNN [Irsoy and Cardie2013]. DCNN [Kalchbrenner et al.2014].CNN-multichannel [Kim2014]. DMN [Kumar et al.2015]

4 Hyperparameters and Training Details

The evaluated models are trained using the AdaGrad [Duchi et al.2010] algorithm using 0.01 learning rate and a minibatch of size 25 sentences. L2-regularization is performed on the model parameters with a value

. We use dropout with probability 0.5 on both the input layer and the softmax layer.

The word embeddings are initialized using the public available Glove vectors with a 300 dimensionality. The Glove vectors provide 95.5% coverage for the SST dataset. All initialized word vectors are finetuned during the training process along with every other parameter. Every matrix is initialized with the identity matrix multiplied by 0.5 except for the matrices of the softmax layer and the attention layer which are randomly initialized from the normal Gaussian distribution. Every bias vectors is initialized with zeros.

The training process lasts for 40 epochs. During training, we evaluate the network 4 times every epoch and keep the parameters which give the best root accuracy on the development dataset.

5 Conclusion

In this short paper, we propose an extension of Recursive Neural Networks that incorporates a bidirectional approach with gated memory units as well as an attention model on structure level. The proposed models were evaluated on both fine-grained and binary sentiment classification tasks on a sentence level. Our results indicate that both the direction of the computation and the attention on a structural level can enhance the performance of neural networks on a sentiment analysis task.