Selective Attention Based Graph Convolutional Networks for Aspect-Level Sentiment Classification

10/24/2019 ∙ by Xiaochen Hou, et al. ∙ 0

Aspect-level sentiment classification aims to identify the sentiment polarity towards a specific aspect term in a sentence. Most current approaches mainly consider the semantic information by utilizing attention mechanisms to capture the interactions between the context and the aspect term. In this paper, we propose to employ graph convolutional networks (GCNs) on the dependency tree to learn syntax-aware representations of aspect terms. GCNs often show the best performance with two layers, and deeper GCNs do not bring additional gain due to over-smoothing problem. However, in some cases, important context words cannot be reached within two hops on the dependency tree. Therefore we design a selective attention based GCN block (SA-GCN) to find the most important context words, and directly aggregate these information into the aspect-term representation. We conduct experiments on the SemEval 2014 Task 4 datasets. Our experimental results show that our model outperforms the current state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Aspect-level sentiment classification is a fine-grained sentiment analysis task, which aims to identify the sentiment polarity (e.g., positive, negative or neutral) of a specific aspect term appearing in a review. For example, “

Despite a slightly limited menu, everything prepared is done to perfection, ultra fresh and a work of food art.”, the sentiment polarity of the aspect terms “menu” and “food” are negative and positive, respectively. This task provides helpful information for potential customers to assist them in making purchase decisions.

To solve this problem, recent studies have shown that the interactions between the aspect term and its context are crucial to identify the sentiment polarity towards the given term. Attention mechanism is an effective way to learn such interactions. Many models first utilize RNN based encoders and then stack several attention layers to capture these interaction for sentiment classification [Tang et al.2015, Tang, Qin, and Liu2016, Wang et al.2016, Ma et al.2017, Chen et al.2017, Liu and Zhang2017, Li et al.2018a, Wang et al.2018, Fan, Feng, and Zhao2018, Zheng and Xia2018, Wang and Lu2018, Li et al.2018b, Li, Liu, and Zhou2018]. Due to the success of BERT [Devlin et al.2018] on many different NLP tasks as a pre-trained encoder, recently, there have been some researchers adapting BERT with co-attention and self-attention for sentiment classification [Song et al.2019, Xu et al.2019].

All the above approaches attempt to learn a context-aware representation of the aspect term by fully exploiting the semantic information from the context words without considering the syntactic structure. However, it has been shown that syntactic information obtained from dependency parsing is very effective in capturing long-range syntactic relations that are obscure from the surface form [Zhang, Qi, and Manning2018]. There are several ways to utilize the syntactic relations for sentiment classification. In [Dong et al.2014, Nguyen and Shirai2015], they extend RNN to obtain the representation of the aspect term by aggregating the syntactic information from the dependency or constituent trees of the sentence. In [He et al.2018], they propose to use the distance between the context word and the aspect term along the dependency tree as the attention weight.

Since dependency trees are special forms of graphs, we decide to apply GCNs on the dependency tree to aggregate syntactic information into the representations of context and the aspect term. GCNs have been widely used to capture the structure information and get convincing performances in various areas, such as social networks [Hamilton, Ying, and Leskovec2017, Kipf and Welling2016], recommendation systems [Ying et al.2018], relation extraction [Guo, Zhang, and Lu2019]

, knowledge graphs 

[Shang et al.2019] and multi-hop reading comprehension [Tu et al.2019]. Recently [Zhaoa, Houb, and Wua2019] also employs GCNs for aspect-term sentiment analysis. They construct the graph by fully connecting all aspect terms or adjacent aspect terms rather than dependency tree.

Previous works show that GCN models with two layers achieve the best performance [Zhang, Qi, and Manning2018, Xu et al.2018]. Deeper GCNs do not bring additional gain due to the over-smoothing problem [Li, Han, and Wu2018], which makes different nodes have similar representations and lose the distinction among nodes. However, in some cases, the most important context words are more than three-hops away from the aspect term words on the dependency tree. For example, as shown in Figure 1, the sentence is “I thought learning the Mac OS would be hard, but it is easily picked up if you are familiar with a PC.” and the aspect term is “Mac OS”, in the dependency tree, the determined context words “easily picked up” are four-hops away from the aspect term.

Figure 1: Example of Dependency Tree with Multi-hop between Aspect Term and Determined Context Words

In order to solve the above problem, we propose a top- attention selection mechanism working with GCNs. Specifically, we utilize the multi-head attention mechanism to compute the attention weights between any pair of words in the sentence. Then for each word, we choose context words with the highest attention weights as its neighbors and apply a GCN on top of this graph structure defined by the top- context words. In this way, the interactions between the aspect term and context words are not constrained by the dependency tree alone. In the above example, the aspect term “Mac OS” gets direct access to the context words “easily picked up”, thus the sentiment polarity of the aspect term is more likely to be predicted correctly.

In this paper, we propose a selective attention based GCN model for the aspect-level sentiment classification task. First, we apply the pre-trained BERT as an encoder to obtain representations of the aspect term and its context words. These representations are used as initial node features for a GCN module operating on the dependency tree of the sentence. This GCN module is exploited to fuse the syntactic knowledge presented in the dependency tree and semantic information from the BERT encoder. Next, the GCN outputs are fed into the top- multi-head attention selection module to find important context words and generate the attention score matrix. Then on top of the score matrix, we apply a GCN again to integrate information from important context words.

Therefore, there are two separate GCN modules in our model: the first GCN module operates on the dependency tree; the second GCN module operates on the selected top- attention heads. The final aspect term representation integrates semantic representation from BERT, syntactic information from the dependency tree, and the top- attended context words from the sentence sequence. This representation is then fed into the final classification layer. We run experiments on the commonly used benchmark datasets SemEval 2014 Task 4. The results show that our proposed model outperforms the current state-of-the-art.

The main contributions of this work are summarized as follows:

  1. We formulate learning from a dependency tree into a GCN model and take advantage of GCNs to fuse semantic information from the BERT encoder and syntactic information between the aspect term node and its context nodes in the dependency tree;

  2. We propose a selective attention based GCN module to overcome the limitations brought by the dependency trees, to allow the aspect term directly obtaining information from the most important context words according to attention weights;

  3. We conduct experiments on two SemEval 2014 Task 4 datasets. Our model achieves new state-of-the-art results on the datasets.

Figure 2: Model Architecture

Related Work

Various attention mechanisms, such as co-attention, self-attention and hierarchical attention, are utilized by recent models for aspect-level sentiment analysis [Tang et al.2015, Tang, Qin, and Liu2016, Liu and Zhang2017, Li et al.2018a, Wang et al.2018, Fan, Feng, and Zhao2018, Chen et al.2017, Zheng and Xia2018, Wang and Lu2018, Li, Liu, and Zhou2018, Li et al.2018a]. Specifically, they first encode the context and the aspect term by the RNN, and then stack several attention layers to learn the aspect term representations from important context words.

After the success of the pre-trained BERT model,  [Song et al.2019] proposed the Attentional Encoder Network (AEN) which utilizes the pre-trained BERT as an encoder and stacks multi-head attention layers to draw the semantic interactions between aspect term and context words. In [Xu et al.2019], aspect-level sentiment classification is considered as a review reading comprehension (RRC) problem. RRC datasets are post trained on BERT and then fine-tuned to the aspect-level sentiment classification.

Above approaches mainly consider the semantic information, there are also other approaches which incorporate the syntactic information to learn a more enriched representation of the aspect term. [Dong et al.2014] propose the AdaRNN, which adaptively propagates the sentiments of words to target along the dependency tree in a bottom-up manner. [Nguyen and Shirai2015]extends RNN to obtain the representation of the target aspect by aggregating the syntactic information from the dependency and constituent tree of the sentence. [He et al.2018] proposes to use the distance between the context word and the aspect term along the dependency tree as the attention weight.

[Zhaoa, Houb, and Wua2019] attempts to apply GCNs on this aspect-level sentiment classification problem as well. They construct a graph by fully connecting all aspects in one sentence or connecting adjacent aspect terms. Then employ GCNs over the attention mechanism to capture the sentiment dependencies between different aspects in one sentence.

Our proposed GCN model differs from the previous work by utilizing GCNs in two different modules: a GCN on the dependency tree, and a selective attention based GCN. The first GCN module learns syntax-aware representations from the dependency tree. Then a novel top- attention selection layer is employed to select the most important context words and aggregate information from them by the GCN, such that the limitations (See Figure 1) introduced by applying GCNs on dependency trees can be alleviated.

Proposed Model

In this section, we first present the overview of our model. Then, we introduce the main modules in our models in details.

Overview of the Model

The goal of our model is to predict the sentiment polarity of an aspect term in a given sentence. Figure 2 illustrates the overall architecture of the proposed model.

For each instance composing of a sentence-term pair, all the words in the sentence except for the aspect term are defined as context words. According to Figure 2, we perform the aspect sentiment classification by the following steps: (1) Encode both the aspect terms and context words by BERT, and use these representations as the initialized features of the nodes (i.e., either context words or the aspect term) in dependency tree; (2) Perform GCNs over the dependency tree; (3) Employ a novel selective attention based GCN (See the right part in Figure 2) to learn the representation of aspect term; (4) Make prediction based on the aspect term’s representation induced from former steps.

Context Words and the Aspect Term Encoder

BERT Encoder. We use the pre-trained BERT as the encoder to get embeddings of context words and the aspect term. First, we construct the input as “[CLS] + sentence + [SEP] + term + [SEP]” and feed it into BERT. Note that for simplicity, we consider the aspect term as one single word (if it contains more than one word, we just take the average from the word representations). Suppose a sentence consists of words (thus there are context words) and the BERT output of the term word has sub-tokens. Then, the outputs of sentence words from the last layer of BERT are treated as the embedding of context words . Similarly, term representation is obtained, where is the dimension of the BERT output.

Self-attention. After obtaining the embedding of the aspect term, we apply self-attention to summarize the information carried by each sub-token of the aspect term and get a single feature representation as the term feature [Zhong et al.2019]

. We utilize a two-layer Multi-Layer Perceptron (MLP) to compute the scores of sub-tokens and get weighted sum over all sub-tokens. This is formulated as follows:

(1)
(2)

where , , is the transposition of , and denotes activation function. The bias vectors are not shown here for simplicity.

Figure 3: Example of Dependency Tree

GCN on Dependency Trees

With the aspect term representation and context words representations as node features and dependency tree as the graph, we employ a GCN to capture syntactic relations between the term node and its neighboring nodes.

An example of the dependency tree is presented in Figure 3. The sentence is “I use it mostly for content creation (Audio, video, photo editing) and its reliable”. The aspect term is “content creation” (to keep the aspect term as one single word, we mask it with a placeholder before it’s parsed). In the dependency tree, “content creation” is connected with the sentiment context “reliable” directly.

GCNs are designed to deal with data containing graph structure. A graph is constructed by nodes and edges. In each GCN layer, a node iteratively aggregates the information from its one-hop neighbors and update its representation. If we use two GCN layers, the above process is repeated twice, so that each node gets information from two-hop away neighbors. In our case, the graph is constructed by a dependency tree, where each word is treated as a single node and its representation is denoted as the node feature. The information propagation process can be formulated as follows:

(3)

where is the output -th GCN layer, is the input of the first GCN layer, and composed of the context words and the aspect term . denotes the adjacency matrix obtained from the dependency tree, note that we add a self-loop on each node. represents the learnable weights and refers to activation function.

The node features are passed through the GCN layer, the representation of each node is now further enriched by syntactical information.

SA-GCN: Selective Attention based GCN

Although the dependency tree brings syntax information to the representation of each word, it also limits the interactions between the aspect term and other non-syntactically related context words that could be very useful in semantic manner as indicated in Figure 1. Therefore we apply a selective attention based GCN (SA-GCN) block to identify the most important context words and integrate the information of them into representation of the aspect term without the dependency constraints. Our SA-GCN block can be repeated several times and is a parameter that can be tuned on a dataset. Each SA-GCN block is composed of three parts: a multi-head self-attention layer, top- selection and a GCN layer. We will introduce them in detail in the following sections.

Multi-Head Self-Attention. We apply the multi-head self-attention with the representations of all our words to get the attention score matrix (), is the number of heads. It can be formulated as:

(4)
(5)

where , , and are the node representations from the last GCN layer, and are learnable weight matrices, is the dimension of the input node feature, and is the dimension of each head.

The obtained attention score matrix can be considered as an adjacency matrix of a fully-connected graph, where each word is connected to all the other context words with different attention weights. This full attention score matrix was used in attention-guided GCNs for relation extraction [Guo, Zhang, and Lu2019]. However, we propose the following top- attention selection for our aspect-term classification task.

Top- Selection. With the attention score matrix, we can find the top- important context words for each word. is a hyper-parameter and could be fine-tuned on different datasets. The reason why we only choose the top- instead of keeping all the context words is that, some unimportant context words could introduce noise and cause confusion to the classification of the sentiment polarity. For example, the sentence is “To be completely fair, the only redeeming factor was the food, which was above average, but couldn’t make up for all the other deficiencies of Teodora.”, the aspect term is “food” and the sentiment label is positive. Without the top- selection, “food” gets direct access to the context word “deficiencies”

, and it might result in classifying the polarity of

“food” to be negative. But if we only keep the crucial context words, such as “redeeming” and “ above average”, the potential risk could be eliminated. Thus we directly choose

context words with the highest attention weights, and get rid of the probable noise brought by other context words.

We design two strategies for top- selection, head-independent and head-dependent. Head-independent selection is defined as the following: first sum the attention score matrix of each head element-wise, and then find top- context words using the mask generated by the function . For example, returns if is set to 2. Finally, we apply a softmax operation on the updated attention score matrix. The process could be formulated as follows:

(6)
(7)
(8)

where is the attention score matrix of -th head, denotes the element-wise multiplication.

Head-independent selection enables information exchange among all attention heads. Specifically, it makes the context words that are considered important by most heads get emphasized and context words that only show value in single head get ignored.

Head-dependent selection is done based on the top- context words according to the attention score matrix of each head individually. We apply the softmax operation on each top- attention matrix. This step can be formulated as:

(9)

Compared with head-independent selection with exactly words selected, head-dependent selection usually selects a larger number (than ) of important context words and aggregates these words into the representation of the aspect-term.

For simplicity, we will omit the - and - subscript in the later section. The obtained top- score matrix could be treated as an adjacency matrix, where denotes as the weight of the edge connecting word and word . Note that does not contain self-loop, and we add a self-loop for each node.

Multi-head GCN. We apply a one-layer multi-head GCN and get updated node features as follows:

(10)
(11)

where is the output of the -th SA-GCN block and composed by the concatenation of of -th head, is the input of the first SA-GCN block and comes from the GCN layer operating on the dependency tree, is the top- score matrix of -th head, denotes as the learnable weight matrix, and refers to activation function.

Classifier

We extract the aspect term node feature from , which is the output of the last SA-GCN block. Then we feed it into a two-layer MLP to calculate the final classification scores.

(12)

where and denote the learnable weight matrix, is the class number, which is 3 in our case, refers to activation function.

We use cross entropy as the loss function:

(13)

where is the coefficient for L2-regularization, denotes the parameters that need to be regularized, is the true label, is the predicted result.

max width=0.5 Dataset Positive Neutral Negative Train Test Train Test Train Test Laptop 987 341 460 169 866 128 Restaurant 2164 728 633 196 805 196

Table 1: Statistics of Datasets
Model Laptop Restaurant
Acc Macro-F1 Acc Macro-F1
TD-LSTM[Tang et al.2015] 68.2 62.3 75.6 64.2
ATAE-LSTM[Wang et al.2016] 68.6 62.5 77.2 65.0
MemNet[Tang, Qin, and Liu2016] 70.3 64.1 78.2 65.8
IAN[Ma et al.2017] 72.1 - 78.6 -
RAM[Chen et al.2017] 72.1 68.4 78.5 68.5
BERT-SPC[Song et al.2019] 79.0 75.0 84.5 77.0
AEN-BERT[Song et al.2019] 79.9 76.3 83.1 73.8
SDGCN-BERT[Zhaoa, Houb, and Wua2019] 81.4 78.3 83.6 76.5
2-layer GCN on DT 78.8 74.2 84.6 76.6
1-layer GCN on DT + 2 SA-GCN blocks(head-ind) 81.7 78.8 85.8 79.7
  • DT: Dependency Tree; head-ind: head-independent selection.

Table 2: Comparison of our model with various baselines.
Model Laptop Restaurant
Acc Macro-F1 Acc Macro-F1
1-layer GCN on DT + 1 SA-GCN block(head-dep) 80.1 77.0 84.5 76.2
1-layer GCN on DT + 1 SA-GCN block(head-ind) 80.3 76.7 85.2 78.1
2 SA-GCN blocks(head-dep) 78.8 74.7 81.8 71.2
2 SA-GCN blocks(head-ind) 79.8 76.7 82.0 71.2
1-layer GCN on DT + 2 SA-GCN blocks(head-dep) 81.3 78.8 85.7 79.2
1-layer GCN on DT + 2 SA-GCN blocks(head-ind) 81.7 78.8 85.8 79.7
  • head-dep: head-dependent selection.

Table 3: Ablation study of our model

Experiments

Data Sets. We evaluate our model on two benchmark datasets: Restaurant and Laptop reviews from SemEval 2014 Task 4. We remove several examples with “conflict” labels in the reviews. The statistics of these two datasets are listed in Table 1.

Baselines. We compare our model with nine baseline models:

  1. TD-LSTM [Tang et al.2015] develops two target dependent LSTM models to model the left and right context of the aspect, and then concatenates them for prediction.

  2. ATAE-LSTM [Wang et al.2016] incorporates the attention mechanism to concentrate on different parts of a sentence when different aspects are taken as input.

  3. MemNet [Tang, Qin, and Liu2016]

    explicitly captures the importance of each context word when inferring the sentiment polarity of an aspect. The importance degree and text representation are calculated with multiple computational layers, each of which is a neural attention model over an external memory.

  4. IAN [Ma et al.2017] proposes the interactive attention networks (IAN) to interactively learn attentions in the contexts and targets, and generate the representations for targets and contexts separately.

  5. RAM [Chen et al.2017]

    adopts multiple-attention mechanism to capture sentiment features separated by a long distance, and the results of multiple attentions are non-linearly combined with a recurrent neural network.

  6. BERT-SPC [Song et al.2019] feeds “[CLS] + sentence + [SEP] + term + [SEP]” into the BERT model and the BERT outputs are used for prediction.

  7. AEN-BERT [Song et al.2019] uses BERT as the encoder and employs several attention layers.

  8. SDGCN-BERT [Zhaoa, Houb, and Wua2019] employs GCN on graphs of fully connected aspects and adjacent connected aspects to capture the sentiment dependencies between different aspects in one sentence.

  9. 2-layer GCN on DT refers to our implementation of a 2-layer GCN on dependency trees based on BERT encoder.

Parameter Setting. During training, we set the learning rate of BERT and GCN to and respectively. We set the batch size to 16. We use a one layer GCN for the dependency tree module, and a one layer GCN for the selective attention module. We obtain the dependency tree using the Stanford CoreNLP [Manning et al.2014]. We use two blocks of the selective attention module. The dimension of BERT output is 768. The dimension of the GCN layer on dependency trees is 128. We use 4 heads during multi-head attention, and for each head, the hidden dimension is set to 32. In top- selection, we set to 3. We apply dropout [Srivastava et al.2014] and L2 regularization. Dropout rate is 0.1 and the coefficient rate of L2 is .

Evaluation Settings

. The evaluation metrics are accuracy and Macro-F1.

(a) Impact of
(b) Impact of block numbers
Figure 4: Impact of and block numbers on our model over Restaurant dataset

Experimental Results

We investigate our model in the following three aspects: 1) Classification performance comparison; 2) Ablation study; and 3) Hype-parameter analysis.

Classification. Table 2 shows comparisons of our model with the other baselines in terms of classification accuracy and Macro-F1. From this table, we observe that: our model achieves the best results in Laptop and Restaurant domains, outperforms the current state-of-the-art by 0.3% and 1.3% in terms of accuracy, and 0.5% and 2.7% in terms of Macro-F1.

Specifically, our model is significantly superior to 2-layer GCN model on dependency tree on both datasets. This indicates that the proposed SA-GCN block effectively find the most important context words that are crucial in determining the sentiment polarity.

Ablation Study. To demonstrate effectiveness of different modules in our model, we conduct ablation studies by the following settings:

  • 1-layer GCN on DT + 1 SA-GCN block applies 1-layer GCN on dependency trees and then employs one SA-GCN block. head-dep and head-ind indicates applying head-dependent and head-independent during top- selection respectively.

  • 2 SA-GCN blocks directly applies two SA-GCN blocks, and removes the GCN over dependency trees.

  • 1-layer GCN on DT + 2 SA-GCN blocks firstly applies the 1-layer GCN on dependency trees and then employs 2 SA-GCN blocks.

Table 3 shows the ablation study we conduct on our model. From this table, we can observe that:

  1. Effect of SA-GCN. Compared with 2-layer GCN on dependency trees, in Laptop and Restaurant domains, our 1-layer GCN on dependency trees + 1 SA-GCN block model improves by 1.3%, 0.6% on accuracy and 2.5%, 1.5% on Macro-F1 respectively, which shows that SA-GCN block is beneficial to this task.

    This is because the proposed SA-GCN block breaks the limitation introduced by a 2-layer GCN on the dependency tree and allows the aspect terms to directly absorb the information from the most important context words that are not reachable within two hops in the dependency tree.

  2. Effect of Syntactic Information. Compared with 1-layer GCN on dependency trees + 1 SA-GCN block, the 2 SA-GCN blocks model without GCN on dependency trees decreases the accuracy and Macro-F1 on both datasets. This implies that the syntactic information introduced by the dependency trees is important and needed.

    We give an example in Figure 3 as a case study: the example is labeled as “positive” towards the aspect term “content creation”, however it’s classified as “neutral” by 2 SA-GCN blocks model without dependency trees. In contrast, with the help of syntactic information introduced by the dependency tree, “content creation” is directly linked with the sentiment context word “reliable”, 1-layer GCN on dependency trees + 1 SA-GCN block makes a correct prediction.

  3. Effect of Head-independent and Head-dependent Selection. As shown in the last two rows in Table 3, head-independent selection achieves better results than head-dependent selection. This is because the mechanism of head-independent selection is similar to voting. By summing up the weight scores from each head, context words with higher scores in most heads get emphasized, and words that only show importance in few heads are filtered out. Thus all heads reach to an agreement and the top- context words are decided. However for head-dependent selection, each head selects different top- context words, which is more likely to choose certain unimportant context words and introduce some noise.

Hype-parameter Analysis. We explore the effect of the hype-parameter and the block number on our proposed model under head-independent and head-dependent selection respectively. Figure 4 shows the results on “Restaurant”. Similar results can be found on “Laptop”.

  1. Effect of hype-parameter . From Figure 3(a), we can observe that: 1) the highest accuracy appears when is equal to 3. As becomes bigger, the accuracy goes down. The reason is that integrating information from too many context words could introduce distractions and confuse the representation of the current word.

    2) Head-independent selection performs better than head-dependent selection as increasing. This is because, with head-independent selection, the same top- context words with different weights are used in different heads. Thus, each word only aggregates information from the exact context words. In contrast, with head-dependent selection, each head might have different selections of context words. Thus more than context words contribute to the aggregation, and it’s more likely to introduce some noise.

  2. Effect of Block Number. We explore different number of SA-GCN blocks on Restaurant dataset. The results are presented in Figure 3(b). As the block number increases, the accuracy decreases for head-independent and head-dependent selection. Our SA-GCN block is designed to bring important context words that are multi-hops away on the dependency tree from the aspect term into the GCN, without using a deeper GCN. However, SA-GCN is also prone to over-smoothing with more layers. Our future work will look into ways to work on deeper GCN [Guo et al.2019].

Conclusions

We propose a selective attention based GCN model for the aspect-level sentiment classification task. We first encode the aspect term and context words by pre-trained BERT to capture the interaction between context words and the aspect term, then build a GCN on the dependency tree to incorporate syntactic information. In order to alleviate the limitation introduced by shallow GCNs on the dependency tree, we design a selective attention GCN block, to select the top- important context words and integrate their information into the GCN for the aspect term representation learning. Our extensive experiments show that our model outperforms previous strong baselines and achieve the new state-of-the-art results on the SemEval 2014 Task 4 datasets.

References