Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional Networks

09/08/2019 ∙ by Chen Zhang, et al. ∙ Università di Padova Beijing Institute of Technology 0

Due to their inherent capability in semantic alignment of aspects and their context words, attention mechanism and Convolutional Neural Networks (CNNs) are widely applied for aspect-based sentiment classification. However, these models lack a mechanism to account for relevant syntactical constraints and long-range word dependencies, and hence may mistakenly recognize syntactically irrelevant contextual words as clues for judging aspect sentiment. To tackle this problem, we propose to build a Graph Convolutional Network (GCN) over the dependency tree of a sentence to exploit syntactical information and word dependencies. Based on it, a novel aspect-specific sentiment classification framework is raised. Experiments on three benchmarking collections illustrate that our proposed model has comparable effectiveness to a range of state-of-the-art models, and further demonstrate that both syntactical information and long-range word dependencies are properly captured by the graph convolution structure.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Code and preprocessed dataset for EMNLP 2019 paper titled "Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional Networks"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Aspect-based (also known as aspect-level) sentiment classification aims at identifying the sentiment polarities of aspects explicitly given in sentences. For example, in a comment about a laptop saying “From the speed to the multi-touch gestures this operating system beats Windows easily.”, the sentiment polarities for two aspects operating system and Windows are positive and negative, respectively. Generally, this task is formulated as predicting the polarity of a provided (sentence, aspect) pair.

Given the inefficiency of manual feature refinement jiang2011target, early works of aspect-based sentiment classification are mainly based on neural network methods dong2014adaptive; vo2015target. Ever since tang2016effective

pointed out the challenge of modelling semantic relatedness between context words and aspects, attention mechanism coupled with Recurrent Neural Networks (RNNs) 

bahdanau2014neural; luong2015effective; xu2015show starts to play a critical role in more recent models wang2016attention; tang2016aspect; yang2017attention; liu2017attention; ma2017interactive; huang2018aspect.

While attention-based models are promising, they are insufficient to capture syntactical dependencies between context words and the aspect within a sentence. Consequently, the current attention mechanism may lead to a given aspect mistakenly attending to syntactically unrelated context words as descriptors (Limitation 1). Look at a concrete example “Its size is ideal and the weight is acceptable.”. Attention-based models often identify acceptable as a descriptor of the aspect size, which is in fact not the case. In order to address the issue,  he2018effective imposed some syntactical constraints on attention weights, but the effect of syntactical structure was not fully exploited.

In addition to the attention-based models, Convolutional Neural Networks (CNNs) xue2018aspect; li2018transformation have been employed to discover descriptive multi-word phrases for an aspect, based on the finding fan2018convolution that the sentiment of an aspect is usually determined by key phrases instead of individual words. Nevertheless, the CNN-based models can only perceive multi-word features as consecutive words with the convolution operations over word sequences, but are inadequate to determine sentiments depicted by multiple words that are not next to each other (Limitation 2). In the sentence “The staff should be a bit more friendly” with staff as the aspect, a CNN-based model may make an incorrect prediction by detecting more friendly as the descriptive phrase, disregarding the impact of should be which is two words away but reverses the sentiment.

In this paper, we aim to tackle the two limitations identified above by using Graph Convolutional Networks (GCNs) kipf2017semi. GCN has a multi-layer architecture, with each layer encoding and updating the representation of nodes in the graph using features of immediate neighbors. Through referring to syntactical dependency trees, a GCN is potentially capable of drawing syntactically relevant words to the target aspect, and exploiting long-range multi-word relations and syntactical information with GCN layers. GCNs have been deployed on document-word relationships yao2018graph and tree structures marcheggiani2017encoding; zhang2018graph, but how they can be effectively used in aspect-based sentiment classification is yet to be explored.

To fill the gap, this paper proposes an Aspect-specific Graph Convolutional Network (ASGCN), which, to the best of our knowledge, is the first GCN-based model for aspect-based sentiment classification. ASGCN starts with a bidirectional Long Short-Term Memory network (LSTM) layer to capture contextual information regarding word orders. In order to obtain aspect-specific features, a multi-layered graph convolution structure is implemented on top of the LSTM output, followed by a masking mechanism that filters out non-aspect words and keeps solely high-level aspect-specific features. The aspect-specific features are fed back to the LSTM output for retrieving informative features with respect to the aspect, which are then used to predict aspect-based sentiment.

Experiments on three benchmarking datasets show that ASGCN effectively addresses both limitations of the current aspect-based sentiment classification approaches, and outperforms a range of state-of-the-art models.

Our contributions are as follows:

  • We propose to exploit syntactical dependency structures within a sentence and resolve the long-range multi-word dependency issue for aspect-based sentiment classification.

  • We posit that Graph Convolutional Network (GCN) is suitable for our purpose, and propose a novel Aspect-specific GCN model. To our best knowledge, this is the first investigation in this direction.

  • Extensive experiment results verify the importance of leveraging syntactical information and long-range word dependencies, and demonstrate the effectiveness of our model in capturing and exploiting them in aspect-based sentiment classification.

2 Graph Convolutional Networks

GCNs can be considered as an adaptation of the conventional CNNs for encoding local information of unstructured data. For a given graph with nodes, an adjacency matrix222 indicates whether the -th token is adjacent to the -th token or not. is obtained through enumerating the graph. For convenience, we denote the output of the -th layer for node as , where represents the initial state of node . For an -layer GCN, and is the final state of node . The graph convolution operated on the node representation can be written as:



is a linear transformation weight,

is a bias term, and

is a nonlinear function, e.g. ReLU. For a better illustration, an example of GCN layer is shown in Figure 


Figure 1: An example of GCN layer.

As the graph convolution process only encodes information of immediate neighbors, a node in the graph can only be influenced by the neighbouring nodes within steps in an -layer GCN. In this way, the graph convolution over the dependency tree of a sentence provides syntactical constraints for an aspect within the sentence to identify descriptive words based on syntactical distances. Moreover, GCN is able to deal with the circumstances where the polarity of an aspect is described by non-consecutive words, as GCN over dependency tree will gather the non-consecutive words into a smaller scope and aggregate their features properly with graph convolution. Therefore, we are inspired to adopt GCN to leverage syntactical information and long-range word dependencies for aspect-based sentiment classification.

3 Aspect-specific Graph Convolutional Network

Figure 2 gives an overview of ASGCN. The components of ASGCN will be introduced separately in the rest of the section.

Figure 2: Overview of aspect-specific graph convolutional network.

3.1 Embedding and Bidirectional LSTM

Given a -word sentence containing a corresponding -word aspect starting from the

-th token, we embed each word token into a low-dimensional real-valued vector space 

bengio2003neural with embedding matrix , where is the size of vocabulary and denotes the dimensionality of word embeddings. With the word embeddings of the sentence, a bidirectional LSTM is constructed to produce hidden state vectors , where represents the hidden state vector at time step from the bidirectional LSTM, and is the dimensionality of a hidden state vector output by an unidirectional LSTM.

3.2 Obtaining Aspect-oriented Features

Different from general sentiment classification, aspect-based sentiment classification targets at judging sentiments from the view of aspects, and thus calls for an aspect-oriented feature extraction strategy. In this study, we obtain aspect-oriented features by applying multi-layer graph convolution over the syntactical dependency tree of a sentence, and imposing an aspect-specific masking layer on its top.

3.2.1 Graph Convolution over Dependency Trees

Aiming to address the limitations of existing approaches (as discussed in previous sections), we leverage a graph convolutional network over dependency trees of sentences. Specifically, after the dependency tree333We use spaCy toolkit: of the given sentence is constructed, we first attain an adjacency matrix according to the words in the sentence. It is important to note that dependency trees are directed graphs. While GCNs generally do not consider directions, they could be adapted to the direction-aware scenario. Accordingly, we propose two variants of ASGCN, i.e. ASGCN-DG on dependency graphs which are un-directional, and ASGCN-DT concerning dependency trees which are directional. Practically, the only difference between ASGCN-DG and ASGCN-DT lies in their adjacency matrices: The adjacency matrix of ASGCN-DT is much more sparse than that of ASGCN-DG. Such setting is in accordance with the phenomenon that parents nodes are broadly influenced by their children nodes. Furthermore, following the idea of self-looping in kipf2017semi, each word is manually set adjacent to itself, i.e. the diagonal values of are all ones.

The ASGCN variants are performed in a multi-layer fashion, on top of the bidirectional LSTM output in Section 3.1, i.e. to make nodes aware of context zhang2018graph. Then the representation of each node is updated with graph convolution operation with normalization factor kipf2017semi as below:


where is the -th token’s representation evolved from the preceding GCN layer while is the product of current GCN layer, and is degree of the -th token in the tree. The weights and bias are trainable parameters.

It is worth noting that we do not have immediately fed into successive GCN layer, but conduct a position-aware transformation in the first place:


where is a function assigning position weights, widely adopted by previous works li2018transformation; tang2016aspect; chen2017recurrent, for augmenting the importance of context words close to the aspect. By doing so we aim at reducing the noise and bias that may have naturally arisen from the dependency parsing process. Specifically, the function is:


where is the position weight to -th token. The final outcome of the -layer GCN is .

3.2.2 Aspect-specific Masking

In this layer, we mask out hidden state vectors of non-aspect words and keep the aspect word states unchanged:


The outputs of this zero-masking layer are the aspect-oriented features . Through graph convolution, these features have perceived contexts around the aspect in such a way that considers both syntactical dependencies and long-range multi-word relations.

3.3 Aspect-aware Attention

Based on the aspect-oriented features, a refined representation of the hidden state vectors is produced via a novel retrieval-based attention mechanism. The idea is to retrieve significant features that are semantically relevant to the aspect words from the hidden state vectors, and accordingly set a retrieval-based attention weight for each context word. In our implementation, the attention weights are computed as below:


Here, the dot product is used to measure the semantic relatedness between aspect component words and words in the sentence so that aspect-specific masking, i.e. zero masking, could take effect as shown in Equation 8. The final representation for prediction is therefore formulated as:


3.4 Sentiment Classification

Having obtained the representation

, it is then fed into a fully-connected layer, followed by a softmax normalization layer to yield a probability distribution

over polarity decision space:


where is the same as the dimensionality of sentiment labels while and are the learned weight and bias, respectively.

3.5 Training

This model is trained by the standard gradient descent algorithm with the cross-entropy loss and -regularization:


where denotes the collection of data sets, is the label and means the -th element of , represents all trainable parameters, and is the coefficient of -regularization.

4 Experiments

4.1 Datasets and Experimental Settings

Our experiments are conducted on five datasets: one (Twitter) is originally built by dong2014adaptive containing twitter posts, while the other four (Lap14, Rest14, Rest15, Rest16) are respectively from SemEval 2014 task 4 pontiki2014semeval, SemEval 2015 task 12 pontiki2015semeval and SemEval 2016 task 5 pontiki2016semeval, consisting of data from two categories, i.e. laptop and restaurant. Following previous work tang2016aspect, we remove samples with conflicting444An opinion target is associated with different sentiment polarities. polarities or without explicit aspects in the sentences in Rest15 and Rest16. The statistics of datasets are reported in Table 1.

Dataset # Pos. # Neu. # Neg.
Twitter Train 1561 3127 1560
Test 173 346 173
Lap14 Train 994 464 870
Test 341 169 128
Rest14 Train 2164 637 807
Test 728 196 196
Rest15 Train 912 36 256
Test 326 34 182
Rest16 Train 1240 69 439
Test 469 30 117
Table 1: Dataset statistics.

For all our experiments, 300-dimensional pre-trained GloVe vectors pennington2014glove

are used to initialize word embeddings. All model weights are initialized with uniform distribution. The dimensionality of hidden state vectors is set to 300. We use Adam as the optimizer with a learning rate of 0.001. The coefficient of

-regularization is 10−5 and batch size is 32. Moreover, the number of GCN layers is set to 2, which is the best-performing depth in pilot studies.

The experimental results are obtained by averaging 3 runs with random initialization, where Accuracy and Macro-Averaged F1 are adopted as the evaluation metrics. We also carry out paired t-test on both Accuracy and Macro-Averaged F1 to verify whether the improvements achieved by our models over the baselines are significant.

4.2 Models for Comparison

Model Twitter Lap14 Rest14 Rest15 Rest16
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1
SVM 63.40 63.30 70.49 N/A 80.16 N/A N/A N/A N/A N/A
LSTM 69.56 67.70 69.28 63.09 78.13 67.47 77.37 55.17 86.80 63.88
MemNet 71.48 69.90 70.64 65.17 79.61 69.64 77.31 58.28 85.44 65.99
AOA 72.30 70.20 72.62 67.52 79.97 70.42 78.17 57.02 87.50 66.21
IAN 72.50 70.81 72.05 67.38 79.26 70.09 78.54 52.65 84.74 55.21
TNet-LF 72.98 71.43 74.61 70.14 80.42 71.03 78.47 59.47 89.07 70.43
ASCNN 71.05 69.45 72.62 66.72 81.73 73.10 78.47 58.90 87.39 64.56
ASGCN-DT 71.53 69.68 74.14 69.24 80.86 72.19 79.34 60.78 88.69 66.64
ASGCN-DG 72.15 70.40 75.55 71.05 80.77 72.02 79.89 61.89 88.99 67.48
Table 2: Model comparison results (%). Average accuracy and macro-F1 score over 3 runs with random initialization. The best two results with each dataset are in bold. The results with are retrieved from the original papers and the results with are retrieved from dong2014adaptive. The marker refers by comparing with ASCNN in paired t-test and the marker refers by comparing with TNet-LF in paired t-test.

In order to comprehensively evaluate the two variants of our model, namely, ASGCN-DG and ASGCN-DT, we compare them with a range of baselines and state-of-the-art models, as listed below:

  • SVM kiritchenko2014nrc is the model which has won SemEval 2014 task 4 with conventional feature extraction methods.

  • LSTM tang2016effective uses the last hidden state vector of LSTM to predict sentiment polarity.

  • MemNet tang2016aspect considers contexts as external memories and benefits from a multi-hop architecture.

  • AOA huang2018aspect borrows the idea of attention-over-attention from the field of machine translation.

  • IAN ma2017interactive interactively models the relationships between aspects and their contexts.

  • TNet-LF li2018transformation puts forward Context-Preserving Transformation (CPT) to preserve and strengthen the informative part of contexts.

In order to examine to what degrees GCN would outperform CNN, we also involve a model named ASCNN in the experiment, which replaces 2-layer GCN with 2-layer CNN in ASGCN555

In order to ensure the length of input and output is consistent, kernel length is set to 3 and padding is 1.


4.3 Results

As is shown in Table 2, ASGCN-DG consistently outperforms all compared models on Lap14 and Rest15 datasets, and achieves comparable results on Twitter and Rest16 datasets compared with baseline TNet-LF and on Rest14 compared with ASCNN. The results demonstrate the effectiveness of ASGCN-DG and the insufficiency of directly integrating syntax information into attention mechanism as in he2018effective. Meanwhile, ASGCN-DG performs better than ASGCN-DT by a large margin on Twitter, Lap14, Rest15 and Rest16 datasets. And ASGCN-DT’s result is lower than TNet-LF’s on Lap14. A possible reason is that the information from parents nodes is as important as that from children nodes, so treating dependency trees as directed graphs leads to information loss. Additionally, ASGCN-DG outperforms ASCNN on all datasets except Rest14, illustrating ASGCN is better at capturing long-range word dependencies, while to some extent ASCNN shows an impact brought by aspect-specific masking. We suspect Rest14 dataset is not so sensitive to syntactic information. Moreover, the sentences from Twitter dataset are less grammatical, restricting the efficacy. We conjecture this is likely the reason why ASGCN-DG and ASGCN-DT get sub-optimal results on Twitter dataset.

4.4 Ablation Study

To further examine the level of benefit that each component of ASGCN brings to the performance, an ablation study is performed on ASGCN-DG. The results are shown in Table 3. We also present the results of BiLSTM+Attn as a baseline, which uses two LSTMs for the aspect and the context respectively.

Model Twitter Lap14 Rest14 Rest15 Rest16
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1
BiLSTM+Attn 71.24 69.55 72.83 67.82 79.85 70.03 78.97 58.18 87.28 68.18
ASGCN-DG 72.15 70.40 75.55 71.05 80.77 72.02 79.89 61.89 88.99 67.48
ASGCN-DG w/o pos. 72.69 70.59 73.93 69.63 81.22 72.94 79.58 61.55 88.04 66.63
ASGCN-DG w/o mask 72.64 70.63 72.05 66.56 79.02 68.29 77.80 57.51 86.36 61.41
ASGCN-DG w/o GCN 71.92 70.63 73.51 68.83 79.40 69.43 79.40 61.18 87.55 66.19
Table 3: Ablation study results (%). Accuracy and macro-F1 scores are the average value over 3 runs with random initialization.

First, removal of position weights (i.e. ASGCN-DG w/o pos.) leads to performance drops on Lap14, Rest15 and Rest16 datasets but performance boosts on Twitter and Rest14 datasets. Recall the main results on Rest14 dataset, we conclude that the integration of position weights is not helpful to reduce noise of user generated contents if syntax is not crucial for the data. Moreover, after we get rid of aspect-specific masking (i.e. ASGCN-DG w/o masking), the model could not keep as competitive as TNet-LF. This verifies the significance of aspect-specific masking.

Compared with ASGCN-DG, ASGCN-DG w/o GCN (i.e. preserving position weights and aspect-specific masking, but without using GCN layers) is much less powerful on all five datasets except F1 metric on Twitter dataset. However, ASGCN-DG w/o GCN is still slightly better than BiLSTM+Attn on all datasets except Rest14 dataset, due to the strength of the aspect-specific masking mechanism.

Thus it could be concluded that GCN contributes to ASGCN to a considerable extent since GCN captures syntatic word dependencies and long-range word relations at the same time. Nevertheless, the GCN does not work well as expected on the datasets not sensitive to syntax information, as we have seen in Twitter and Rest14 datasets.

4.5 Case Study

To better understand how ASGCN works, we present a case study with several testing examples. Particularly, we visualize the attention scores offered by MemNet, IAN, ASCNN and ASGCN-DG in Table 4, along with their predictions on these examples and the corresponding ground truth labels.

The first sample “great food but the service was dreadful!” has two aspects within one sentence, which may hinder attention-based models from aligning the aspects with their relevant descriptive words precisely. The second sample sentence “The staff should be a bit more friendly.” uses a subjunctive word “should”, bringing extra difficulty in detecting implicit semantics. The last example contains negation in the sentence, that can easily lead models to make wrong predictions.

MemNet fails in all three presented samples. While IAN is capable of differing modifiers for distinct aspects, it fails to infer sentiment polarities of sentences with special styles. Armed with position weights, ASCNN correctly predicts the label for the second sample as the phrase should be is close to the aspect staff, but failed for the third one with a longer-range word dependency. Our ASGCN-DG correctly handles all the three samples, implying that GCN effectively integrates syntactic dependency information into an enriched semantic representation. In particular, ASGCN-DG makes correct predictions on the second and the third sample, both having a seemingly biased focus. This shows ASGCN’s capability of capturing long-range multi-word features.

Model Aspect Attention visualization Prediction Label
MemNet food

great food but the service was dreadful !

negative positive

The staff should be a bit more friendly .

positive negative
Windows 8

Did not enjoy the new Windows 8 and touchscreen functions .

positive negative
IAN food

great food but the service was dreadful !

positive positive

The staff should be a bit more friendly .

positive negative
Windows 8

Did not enjoy the new Windows 8 and touchscreen functions .

neutral negative
ASCNN food

great food but the service was dreadful !

positive positive

The staff should be a bit more friendly .

negative negative
Windows 8

Did not enjoy the new Windows 8 and touchscreen functions .

positive negative

great food but the service was dreadful !

positive positive

The staff should be a bit more friendly .

negative negative
Windows 8

Did not enjoy the new Windows 8 and touchscreen functions .

negative negative
Table 4: Case study. Visualization of attention scores from MemNet, IAN, ASCNN and ASGCN-DG on testing examples, along with their predictions and correspondingly, golden labels. The marker ✓ indicates correct prediction while the marker ✗ indicates incorrect prediction.

5 Discussion

5.1 Investigation on the Impact of GCN Layers

As ASGCN involves an -layer GCN, we investigate the effect of the layer number on the final performance of ASGCN-DG. Basically, we vary the value of in the set {1,2,3,4,6,8,12} and check the corresponding Accuracy and Macro-Averaged F1 of ASGCN-DG on the Lap14 dataset. The results are illustrated in Figure 3.

Figure 3: Effect of the number of GCN layers. Accuracy and macro-F1 scores are the average value over 3 runs with random initialization.

On both metrics, ASGCN-DG achieves the best performance when is 2, which justifies the selection on the number of layers in the experiment section. Moreover, a dropping trend on both metrics is present as increases. For large , especially when equals to 12, ASGCN-DG basically becomes more difficult to train due to large amount of parameters.

5.2 Investigation on the Effect of Multiple Aspects

In the datasets, there might exist multiple aspect terms in one sentence. Thus, we intend to measure whether such phenomena would affect the effectiveness of ASGCN. We divide the training samples in Lap14 and Rest14

datasets into different groups based on the number of aspect terms in the sentences and compute the training accuracy differences between these groups. It is worth noting that the samples with more than 7 aspect terms are removed as outliers because the sizes of these samples are too small for any meaningful comparison.

Figure 4: Accuracy versus the number of aspects (# Aspects) in the sentences.

It can be seen in Figure 4 that when the number of aspects in the sentences is more than 3, the accuracy becomes fluctuated, indicating a low robustness in capturing multiple-aspect correlations and suggesting the need of modelling multi-aspect dependencies in future work.

6 Related Work

Constructing neural network models over word sequences, such as CNNs kim2014convolutional; johnson2015semi, RNNs tang2016effective and Recurrent Convolutional Neural Networks (RCNNs) lai2015recurrent

, has achieved promising performances in sentiment analysis. However, the importance but lack of an effective mechanism of leveraging dependency trees for capturing distant relations of words has also been recognized.  

tai2015improved showed that LSTM with dependency trees or constituency trees outperformed CNNs.  dong2014adaptive presented an adaptive recursive neural network using dependency trees, which achieved competitive results compared with strong baselines. More recent research showed that general dependency-based models are difficult to achieve comparable results to the attention-based models, as dependency trees are not capable of catching long-term contextualized semantic information properly. Our work overcomes this limitation by adopting Graph convolutional networks (GCNs)  kipf2017semi .

GCN has recently attracted a growing attention in the area of artificial intelligence and has been applied to Natural Language Processing (NLP).  

marcheggiani2017encoding claimed that GCN could be considered as a complement to LSTM, and proposed a GCN-based model for semantic role labeling.  vashishth2018dating and  zhang2018graph used graph convolution over dependency trees in document dating and relation classification, respectively.  yao2018graph introduced GCN to text classification utilizing document-word and word-word relations, and gained improvements over various state-of-the-art methods. Our work investigates the effect of dependency trees in depth via graph convolution, and develops aspect-specific GCN model that integrates with the LSTM architecture and attention mechanism for more effective aspect-based sentiment classification.

7 Conclusions and Future Work

We have re-examined the challenges encountering existing models for aspect-specific sentiment classification, and pointed out the suitability of graph convolutional network (GCN) for tackling these challenges. Accordingly, we have proposed a novel network to adopt GCN for aspect-based sentiment classification. Experimental results have indicated that GCN brings benefit to the overall performance by leveraging both syntactical information and long-range word dependencies.

This study may be further improved in the following aspects. First, the edge information of the syntactical dependency trees, i.e. the label of each edge, is not exploited in this work. We plan to design a specific graph neural network that takes into consideration the edge labels. Second, domain knowledge can be incorporated. Last but not least, the ASGCN model may be extended to simultaneously judge sentiments of multiple aspects by capturing dependencies between the aspects words.


This work is supported by The National Key Research and Development Program of China (grant No. 2018YFC0831704), Natural Science Foundation of China (grant No. U1636203, 61772363), Major Project of Zhejiang Lab (grant No. 2019DH0ZX01), and the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement No. 721321.