Transformation Networks for Target-Oriented Sentiment Classification (ACL'18)
Target-oriented sentiment classification aims at classifying sentiment polarities over individual opinion targets in a sentence. RNN with attention seems a good fit for the characteristics of this task, and indeed it achieves the state-of-the-art performance. After re-examining the drawbacks of attention mechanism and the obstacles that block CNN to perform well in this classification task, we propose a new model to overcome these issues. Instead of attention, our model employs a CNN layer to extract salient features from the transformed word representations originated from a bi-directional RNN layer. Between the two layers, we propose a component to generate target-specific representations of words in the sentence, meanwhile incorporate a mechanism for preserving the original contextual information from the RNN layer. Experiments show that our model achieves a new state-of-the-art performance on a few benchmarks.READ FULL TEXT VIEW PDF
Aspect-level sentiment classification (ASC) aims to detect the sentiment...
Recent years, the approaches based on neural networks have shown remarka...
Target-oriented opinion words extraction (TOWE) is a new subtask of ABSA...
Cross-domain sentiment classification (CDSC) is an importance task in do...
Open-domain targeted sentiment analysis aims to detect opinion targets a...
The growing prosperity of social networks has brought great challenges t...
We propose MVCNN, a convolution neural network (CNN) architecture for
Transformation Networks for Target-Oriented Sentiment Classification (ACL'18)
Target-oriented (also mentioned as “target-level” or “aspect-level” in some works) sentiment classification aims to determine sentiment polarities over “opinion targets” that explicitly appear in the sentences Liu (2012). For example, in the sentence “I am pleased with the fast log on, and the long battery life”, the user mentions two targets “log on” and “better life”, and expresses positive sentiments over them. The task is usually formulated as predicting a sentiment category for a (target, sentence) pair.
Recurrent Neural Networks (RNNs) with attention mechanism, firstly proposed in machine translation Bahdanau et al. (2014), is the most commonly-used technique for this task. For example, Wang et al. (2016); Tang et al. (2016b); Yang et al. (2017); Liu and Zhang (2017); Ma et al. (2017) and Chen et al. (2017) employ attention to measure the semantic relatedness between each context word and the target, and then use the induced attention scores to aggregate contextual features for prediction. In these works, the attention weight based combination of word-level features for classification may introduce noise and downgrade the prediction accuracy. For example, in “This dish is my favorite and I always get it and never get tired of it.”, these approaches tend to involve irrelevant words such as “never” and “tired” when they highlight the opinion modifier “favorite”. To some extent, this drawback is rooted in the attention mechanism, as also observed in machine translation Luong et al. (2015) and image captioning Xu et al. (2015).
Another observation is that the sentiment of a target is usually determined by key phrases such as “is my favorite”. By this token, Convolutional Neural Networks (CNNs)—whose capability for extracting the informative n-gram features (also called “active local features”) as sentence representations has been verified inKim (2014); Johnson and Zhang (2015)— should be a suitable model for this classification problem. However, CNN likely fails in cases where a sentence expresses different sentiments over multiple targets, such as “great food but the service was dreadful!”. One reason is that CNN cannot fully explore the target information as done by RNN-based methods Tang et al. (2016a).222One method could be concatenating the target representation with each word representation, but the effect as shown in Wang et al. (2016) is limited. Moreover, it is hard for vanilla CNN to differentiate opinion words of multiple targets. Precisely, multiple active local features holding different sentiments (e.g., “great food” and “service was dreadful”) may be captured for a single target, thus it will hinder the prediction.
We propose a new architecture, named Target-Specific Transformation Networks (TNet), to solve the above issues in the task of target sentiment classification. TNet firstly encodes the context information into word embeddings and generates the contextualized word representations with LSTMs. To integrate the target information into the word representations, TNet introduces a novel Target-Specific Transformation (TST) component for generating the target-specific word representations. Contrary to the previous attention-based approaches which apply the same target representation to determine the attention scores of individual context words, TST firstly generates different representations of the target conditioned on individual context words, then it consolidates each context word with its tailor-made target representation to obtain the transformed word representation. Considering the context word “long” and the target “battery life” in the above example, TST firstly measures the associations between “long” and individual target words. Then it uses the association scores to generate the target representation conditioned on “long”. After that, TST transforms the representation of “long” into its target-specific version with the new target representation. Note that “long” could also indicate a negative sentiment (say for “startup time”), and the above TST is able to differentiate them.
As the context information carried by the representations from the LSTM layer will be lost after the non-linear TST, we design a context-preserving mechanism to contextualize the generated target-specific word representations. Such mechanism also allows deep transformation structure to learn abstract features333Abstract features usually refer to the features ultimately useful for the task Bengio et al. (2013); LeCun et al. (2015).. To help the CNN feature extractor locate sentiment indicators more accurately, we adopt a proximity strategy to scale the input of convolutional layer with positional relevance between a word and the target.
In summary, our contributions are as follows:
TNet adapts CNN to handle target-level sentiment classification, and its performance dominates the state-of-the-art models on benchmark datasets.
A novel Target-Specific Transformation component is proposed to better integrate target information into the word representations.
A context-preserving mechanism is designed to forward the context information into a deep transformation architecture, thus, the model can learn more abstract contextualized word features from deeper networks.
Given a target-sentence pair , where is a sub-sequence of , and the corresponding word embeddings and , the aim of target sentiment classification is to predict the sentiment polarity of the sentence over the target , where , and denote “positive”, “negative” and “neutral” sentiments respectively.
The architecture of the proposed Target-Specific Transformation Networks (TNet) is shown in Fig. 1. The bottom layer is a BiLSTM which transforms the input into the contextualized word representations (i.e. hidden states of BiLSTM), where and
denote the dimensions of the word embeddings and the hidden representations respectively. The middle part, the core part of our TNet, consists ofContext-Preserving Transformation (CPT) layers. The CPT layer incorporates the target information into the word representations via a novel Target-Specific Transformation (TST) component. CPT also contains a context-preserving mechanism, resembling identity mapping He et al. (2016a, b) and highway connection Srivastava et al. (2015a, b), allows preserving the context information and learning more abstract word-level features using a deep network. The top most part is a position-aware convolutional layer which first encodes positional relevance between a word and a target, and then extracts informative features for classification.
As observed in Lai et al. (2015), combining contextual information with word embeddings is an effective way to represent a word in convolution-based architectures. TNet also employs a BiLSTM to accumulate the context information for each word of the input sentence, i.e., the bottom part in Fig. 1. For simplicity and space issue, we denote the operation of an LSTM unit on as . Thus, the contextualized word representation is obtained as follows:
The above word-level representation has not considered the target information yet. Traditional attention-based approaches keep the word-level features static and aggregate them with weights as the final sentence representation. In contrast, as shown in the middle part in Fig. 1, we introduce multiple CPT layers and the detail of a single CPT is shown in Fig. 2. In each CPT layer, a tailor-made TST component that aims at better consolidating word representation and target representation is proposed. Moreover, we design a context-preserving mechanism enabling the learning of target-specific word representations in a deep neural architecture.
TST component is depicted with the TST block in Fig. 2. The first task of TST is to generate the representation of the target. Previous methods Chen et al. (2017); Liu and Zhang (2017) average the embeddings of the target words as the target representation. This strategy may be inappropriate in some cases because different target words usually do not contribute equally. For example, in the target “amd turin processor”, the word “processor” is more important than “amd” and “turin”, because the sentiment is usually conveyed over the phrase head, i.e.,“processor”, but seldom over modifiers (such as brand name “amd”). Ma et al. (2017)
attempted to overcome this issue by measuring the importance score between each target word representation and the averaged sentence vector. However, it may be ineffective for sentences expressing multiple sentiments (e.g.,“Air has higher resolution but the fonts are small.”), because taking the average tends to neutralize different sentiments.
We propose to dynamically compute the importance of target words based on each sentence word rather than the whole sentence. We first employ another BiLSTM to obtain the target word representations :
Then, we dynamically associate them with each word in the sentence to tailor-make target representation at the time step :
where the function measures the relatedness between the -th target word representation and the -th word-level representation :
Finally, the concatenation of and is fed into a fully-connected layer to obtain the -th target-specific word representation :
is a non-linear activation function and “” denotes vector concatenation. and are the weights of the layer.
After the non-linear TST (see Eq. 5
), the context information captured with contextualized representations from the BiLSTM layer will be lost since the mean and the variance of the features within the feature vector will be changed. To take advantage of the context information, which has been proved to be useful inLai et al. (2015), we investigate two strategies: Lossless Forwarding (LF) and Adaptive Scaling (AS), to pass the context information to each following layer, as depicted by the block “LF/AS” in Fig. 2. Accordingly, the model variants are named TNet-LF and TNet-AS.
This strategy preserves context information by directly feeding the features before the transformation to the next layer. Specifically, the input of the -th CPT layer is formulated as:
where is the input of the -th layer and is the output of TST in this layer. We unfold the recursive form of Eq. 6 as follows:
Here, we denote as . From Eq. 7, we can see that the output of each layer will contain the contextualized word representations (i.e., ), thus, the context information is encoded into the transformed features. We call this strategy “Lossless Forwarding” because the contextualized representations and the transformed representations (i.e., ) are kept unchanged during the feature combination.
Lossless Forwarding introduces the context information by directly adding back the contextualized features to the transformed features, which raises a question: Can the weights of the input and the transformed features be adjusted dynamically? With this motivation, we propose another strategy, named “Adaptive Scaling”. Similar to the gate mechanism in RNN variants Jozefowicz et al. (2015), Adaptive Scaling introduces a gating function to control the passed proportions of the transformed features and the input features. The gate as follows:
where is the gate for the -th input of the -th CPT layer, and is the sigmoid activation function. Then we perform convex combination of and based on the gate:
Here, denotes element-wise multiplication. The non-recursive form of this equation is as follows (for clarity, we ignore the subscripts):
Thus, the context information is integrated in each upper layer and the proportions of the contextualized representations and the transformed representations are controlled by the computed gates in different transformation layers.
Recall that the second issue that blocks CNN to perform well is that vanilla CNN may associate a target with unrelated general opinion words which are frequently used as modifiers for different targets across domains. For example, “service” in “Great food but the service is dreadful” may be associated with both “great” and “dreadful”. To solve it, we adopt a proximity strategy, which is observed effective in Chen et al. (2017); Li and Lam (2017). The idea is a closer opinion word is more likely to be the actual modifier of the target.
|# Positive||# Negative||# Neutral|
Specifically, we first calculate the position relevance between the -th word and the target444 As we perform sentence padding, it is possible that the index
As we perform sentence padding, it is possible that the indexis larger than the actual length of the sentence.:
where is the index of the first target word, is a pre-specified constant, and is the length of the target . Then, we use to help CNN locate the correct opinion w.r.t. the given target:
Based on Eq. 10 and Eq. 11, the words close to the target will be highlighted and those far away will be downgraded. is also applied on the intermediate output to introduce the position information into each CPT layer. Then we feed the weighted to the convolutional layer, i.e., the top-most layer in Fig. 1, to generate the feature map as follows:
where is the concatenated vector of , and is the kernel size. and
are learnable weights of the convolutional kernel. To capture the most informative features, we apply max poolingKim (2014) and obtain the sentence representation by employing kernels:
Finally, we pass to a fully connected layer for sentiment prediction:
where and are learnable parameters.
|dropout rates (, )||(0.3, 0.3)||(0.3, 0.3)|
As shown in Table 1, we evaluate the proposed TNet on three benchmark datasets: LAPTOP and REST are from SemEval ABSA challenge Pontiki et al. (2014), containing user reviews in laptop domain and restaurant domain respectively. We also remove a few examples having the “conflict label” as done in Chen et al. (2017); TWITTER is built by Dong et al. (2014)
, containing twitter posts. All tokens are lowercased without removal of stop words, symbols or digits, and sentences are zero-padded to the length of the longest sentence in the dataset. Evaluation metrics are Accuracy and Macro-Averaged F1 where the latter is more appropriate for datasets with unbalanced classes. We also conduct pairwise t-test on both Accuracy and Macro-Averaged F1 to verify if the improvements over the compared models are reliable.
TNet is compared with the following methods.
AdaRNN Dong et al. (2014): It learns the sentence representation toward target for sentiment prediction via semantic composition over dependency tree;
AE-LSTM, and ATAE-LSTM Wang et al. (2016): AE-LSTM is a simple LSTM model incorporating the target embedding as input, while ATAE-LSTM extends AE-LSTM with attention;
IAN Ma et al. (2017): IAN employs two LSTMs to learn the representations of the context and the target phrase interactively;
CNN-ASP: It is a CNN-based model implemented by us which directly concatenates target representation to each word embedding;
TD-LSTM Tang et al. (2016a): It employs two LSTMs to model the left and right contexts of the target separately, then performs predictions based on concatenated context representations;
MemNet Tang et al. (2016b): It applies attention mechanism over the word embeddings multiple times and predicts sentiments based on the top-most sentence representations;
BILSTM-ATT-G Liu and Zhang (2017): It models left and right contexts using two attention-based LSTMs and introduces gates to measure the importance of left context, right context, and the entire sentence for the prediction;
RAM Chen et al. (2017): RAM is a multi-layer architecture where each layer consists of attention-based aggregation of word features and a GRU cell to learn the sentence representation.
We run the released codes of TD-LSTM and BILSTM-ATT-G to generate results, since their papers only reported results on TWITTER. We also rerun MemNet on our datasets and evaluate it with both accuracy and Macro-Averaged F1.555The codes of TD-LSTM/MemNet and BILSTM-ATT-G are available at: http://ir.hit.edu.cn/~dytang and http://leoncrashcode.github.io. Note that MemNet was only evaluated with accuracy.
We use pre-trained GloVe vectors Pennington et al. (2014) to initialize the word embeddings and the dimension is 300 (i.e.,
). For out-of-vocabulary words, we randomly sample their embeddings from the uniform distribution, as done in Kim (2014). We only use one convolutional kernel size because it was observed that CNN with single optimal kernel size is comparable with CNN having multiple kernel sizes on small datasets Zhang and Wallace (2017). To alleviate overfitting, we apply dropout on the input word embeddings of the LSTM and the ultimate sentence representation . All weight matrices are initialized with the uniform distribution and the biases are initialized as zeros. The training objective is cross-entropy, and Adam Kingma and Ba (2015) is adopted as the optimizer by following the learning rate and the decay rates in the original paper.
|Ablated TNet||TNet w/o transformation||73.30||68.25||78.90||65.86||72.10||70.57|
|TNet w/o context||73.91||68.87||80.07||69.01||74.51||73.05|
|TNet-LF w/o position||75.13||70.63||79.86||69.69||73.83||72.49|
|TNet-AS w/o position||75.27||70.03||79.79||69.78||73.84||72.47|
The hyper-parameters of TNet-LF and TNet-AS are listed in Table 2. Specifically, all hyper-parameters are tuned on 20% randomly held-out training data and the hyper-parameter collection producing the highest accuracy score is used for testing. Our model has comparable number of parameters compared to traditional LSTM-based models as we reuse parameters in the transformation layers and BiLSTM.666All experiments are conducted on a single NVIDIA GTX 1080. The prediction cost of a sentence is about 2 ms.
As shown in Table 3, both TNet-LF and TNet-AS consistently achieve the best performance on all datasets, which verifies the efficacy of our whole TNet model. Moreover, TNet can perform well for different kinds of user generated content, such as product reviews with relatively formal sentences in LAPTOP and REST, and tweets with more ungrammatical sentences in TWITTER. The reason is the CNN-based feature extractor arms TNet with more power to extract accurate features from ungrammatical sentences. Indeed, we can also observe that another CNN-based baseline, i.e., CNN-ASP implemented by us, also obtains good results on TWITTER.
On the other hand, the performance of those comparison methods is mostly unstable. For the tweet in TWITTER, the competitive BILSTM-ATT-G and RAM cannot perform as effective as they do for the reviews in LAPTOP and REST, due to the fact that they are heavily rooted in LSTMs and the ungrammatical sentences hinder their capability in capturing the context features. Another difficulty caused by the ungrammatical sentences is that the dependency parsing might be error-prone, which will affect those methods such as AdaRNN using dependency information.
From the above observations and analysis, some takeaway message for the task of target sentiment classification could be:
LSTM-based models relying on sequential information can perform well for formal sentences by capturing more useful context features;
For ungrammatical text, CNN-based models may have some advantages because CNN aims to extract the most informative n-gram features and is thus less sensitive to informal texts without strong sequential patterns.
To investigate the impact of each component such as deep transformation, context-preserving mechanism, and positional relevance, we perform comparison between the full TNet models and its ablations (the third group in Table 3). After removing the deep transformation (i.e., the techniques introduced in Section 2.2), both TNet-LF and TNet-AS are reduced to TNet w/o transformation (where position relevance is kept), and their results in both accuracy and F1 measure are incomparable with those of TNet. It shows that the integration of target information into the word-level representations is crucial for good performance.
Comparing the results of TNet and TNet w/o context (where TST and position relevance are kept), we observe that the performance of TNet w/o context drops significantly on LAPTOP and REST777Without specification, the significance level is set to 0.05., while on TWITTER, TNet w/o context performs very competitive (-values with TNet-LF and TNet-AS are 0.066 and 0.053 respectively for Accuracy). Again, we could attribute this phenomenon to the ungrammatical user generated content of twitter, because the context-preserving component becomes less important for such data. TNet w/o context performs consistently better than TNet w/o transformation, which verifies the efficacy of the target specific transformation (TST), before applying context-preserving.
As for the position information, we conduct statistical t-test between TNet-LF/AS and TNet-LF/AS w/o position together with performance comparison. All of the produced -values are less than 0.05, suggesting that the improvements brought in by position information are significant.
The next interesting question is what if we replace the transformation module (i.e., the CPT layers in Fig.1) of TNet with other commonly-used components? We investigate two alternatives: attention mechanism and fully-connected (FC) layer, resulting in three pipelines as shown in the second group of Table 3 (position relevance is kept for them).
LSTM-ATT-CNN applies attention as the alternative888We tried different attention mechanisms and report the best one here, namely, dot attention Luong et al. (2015)., and it does not need the context-preserving mechanism. It performs unexceptionally worse than the TNet variants. We are surprised that LSTM-ATT-CNN is even worse than TNet w/o transformation (a pipeline simply removing the transformation module) on TWITTER. More concretely, applying attention results in negative effect on TWITTER, which is consistent with the observation that all those attention-based state-of-the-art methods (i.e., TD-LSTM, MemNet, BILSTM-ATT-G, and RAM) cannot perform well on TWITTER.
LSTM-FC-CNN-LF and LSTM-FC-CNN-AS are built by applying FC layer to replace TST and keeping the context-preserving mechanism (i.e., LF and AS). Specifically, the concatenation of word representation and the averaged target vector is fed to the FC layer to obtain target-specific features. Note that LSTM-FC-CNN-LF/AS are equivalent to TNet-LF/AS when processing single-word targets (see Eq. 3). They obtain competitive results on all datasets: comparable with or better than the state-of-the-art methods. The TNet variants can still outperform LSTM-FC-CNN-LF/AS with significant gaps, e.g., on LAPTOP and REST, the accuracy gaps between TNet-LF and LSTM-FC-CNN-LF are 0.42% ( 0.03) and 0.38% ( 0.04) respectively.
As our TNet involves multiple CPT layers, we investigate the effect of the layer number . Specifically, we conduct experiments on the held-out training data of LAPTOP and vary from 2 to 10, increased by 2. The cases =1 and =15 are also included. The results are illustrated in Figure 3. We can see that both TNet-LF and TNet-AS achieve the best results when =2. While increasing , the performance is basically becoming worse. For large
, the performance of TNet-AS generally becomes more sensitive, it is probably because AS involves extra parameters (see Eq9) that increase the training difficulty.
|1. Air has higher but the are small .||(N, N)||(N, N)||(P, N)||(P, N)|
|2. Great but the is dreadful .||(P, N)||(P, N)||(P, N)||(P, N)|
|3. Sure it ’ s not light and slim but the make up for it 100% .||N||N||P||P|
|4. Not only did they have amazing , , , etc , but their are out of this world !||(P, O, O, P)||(P, P, O, P)||(P, P, P, P)||(P, P, P, P)|
|5. are incredibly long : over two minutes .||P||P||N||N|
|6. I am pleased with the fast , speedy and the long ( 6 hrs ) .||(P, P, P)||(P, P, P)||(P, P, P)||(P, P, P)|
|7. The should be a bit more friendly .||P||P||P||P|
Table 4 shows some sample cases. The input targets are wrapped in the brackets with true labels given as subscripts. The notations P, N and O in the table represent positive, negative and neutral respectively. For each sentence, we underline the target with a particular color, and the text of its corresponding most informative n-gram feature999For each convolutional filter, only one n-gram feature in the feature map will be kept after the max pooling. Among those from different filters, the n-gram with the highest frequency will be regarded as the most informative n-gram w.r.t. the given target. captured by TNet-AS (TNet-LF captures very similar features) is in the same color (so color printing is preferred). For example, for the target “resolution” in the first sentence, the captured feature is “Air has higher”. Note that as discussed above, the CNN layer of TNet captures such features with the size-three kernels, so that the features are trigrams. Each of the last features of the second and seventh sentences contains a padding token, which is not shown.
Our TNet variants can predict target sentiment more accurately than RAM and BILSTM-ATT-G in the transitional sentences such as the first sentence by capturing correct trigram features. For the third sentence, its second and third most informative trigrams are “100% . PAD” and “’ s not”, being used together with “features make up”, our models can make correct predictions. Moreover, TNet can still make correct prediction when the explicit opinion is target-specific. For example, “long” in the fifth sentence is negative for “startup time”, while it could be positive for other targets such as “battery life” in the sixth sentence. The sentiment of target-specific opinion word is conditioned on the given target. Our TNet variants, armed with the word-level feature transformation w.r.t. the target, is capable of handling such case.
We also find that all these models cannot give correct prediction for the last sentence, a commonly used subjunctive style. In this case, the difficulty of prediction does not come from the detection of explicit opinion words but the inference based on implicit semantics, which is still quite challenging for neural network models.
, aspect/target level sentiment classification is also an important research topic in the field of sentiment analysis. The early methods mostly adopted supervised learning approach with extensive hand-coded featuresBlair-Goldensohn et al. (2008); Titov and McDonald (2008); Yu et al. (2011); Jiang et al. (2011); Kiritchenko et al. (2014); Wagner et al. (2014); Vo and Zhang (2015), and they fail to model the semantic relatedness between a target and its context which is critical for target sentiment analysis. Dong et al. (2014) incorporate the target information into the feature learning using dependency trees. As observed in previous works, the performance heavily relies on the quality of dependency parsing. Tang et al. (2016a) propose to split the context into two parts and associate target with contextual features separately. Similar to Tang et al. (2016a), Zhang et al. (2016)
develop a three-way gated neural network to model the interaction between the target and its surrounding contexts. Despite the advantages of jointly modeling target and context, they are not capable of capturing long-range information when some critical context information is far from the target. To overcome this limitation, researchers bring in the attention mechanism to model target-context associationTang et al. (2016a, b); Wang et al. (2016); Yang et al. (2017); Liu and Zhang (2017); Ma et al. (2017); Chen et al. (2017); Zhang et al. (2017); Tay et al. (2017)
. Compared with these methods, our TNet avoids using attention for feature extraction so as to alleviate the attended noise.
We re-examine the drawbacks of attention mechanism for target sentiment classification, and also investigate the obstacles that hinder CNN-based models to perform well for this task. Our TNet model is carefully designed to solve these issues. Specifically, we propose target specific transformation component to better integrate target information into the word representation. Moreover, we employ CNN as the feature extractor for this classification problem, and rely on the context-preserving and position relevance mechanisms to maintain the advantages of previous LSTM-based models. The performance of TNet consistently dominates previous state-of-the-art methods on different types of data. The ablation studies show the efficacy of its different modules, and thus verify the rationality of TNet’s architecture.