Attention-based Aspect-term Sentiment Analysis implemented by tensorflow.
Target-dependent sentiment classification remains a challenge: modeling the semantic relatedness of a target with its context words in a sentence. Different context words have different influences on determining the sentiment polarity of a sentence towards the target. Therefore, it is desirable to integrate the connections between target word and context words when building a learning system. In this paper, we develop two target dependent long short-term memory (LSTM) models, where target information is automatically taken into account. We evaluate our methods on a benchmark dataset from Twitter. Empirical results show that modeling sentence representation with standard LSTM does not perform well. Incorporating target information into LSTM can significantly boost the classification accuracy. The target-dependent LSTM models achieve state-of-the-art performances without using syntactic parser or external sentiment lexicons.READ FULL TEXT VIEW PDF
With the development of the Internet, natural language processing (NLP),...
This paper introduces a novel deep learning framework including a
Targeted sentiment classification aims at determining the sentimental
Tree-based Long short term memory (LSTM) network has become state-of-the...
We consider the problem of learning general-purpose, paraphrastic senten...
Geo-tagged Twitter data has been used recently to infer insights on the ...
User profiling means exploiting the technology of machine learning to pr...
Attention-based Aspect-term Sentiment Analysis implemented by tensorflow.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/
, is a fundamental task in natural language processing and computational linguistics. Sentiment analysis is crucial to understanding user generated text in social networks or product reviews, and has drawn a lot of attentions from both industry and academic communities. In this paper, we focus on target-dependent sentiment classification[Jiang et al.2011, Dong et al.2014, Vo and Zhang2015], which is a fundamental and extensively studied task in the field of sentiment analysis. Given a sentence and a target mention, the task calls for inferring the sentiment polarity (e.g. positive, negative, neutral) of the sentence towards the target. For example, let us consider the sentence: “I bought a new camera. The picture quality is amazing but the battery life is too short”. If the target string is picture quality, the expected sentiment polarity is “positive” as the sentence expresses a positive opinion towards picture quality. If we consider the target as battery life, the correct sentiment polarity should be “negative”.
Target-dependent sentiment classification is typically regarded as a kind of text classification problem in literature. Majority of existing studies build sentiment classifiers with supervised machine learning approach, such as feature based Supported Vector Machine[Jiang et al.2011]
or neural network approaches[Dong et al.2014, Vo and Zhang2015]. Despite the effectiveness of these approaches, we argue that target-dependent sentiment classification remains a challenge: how to effectively model the semantic relatedness of a target word with its context words in a sentence. One straight forward way to address this problem is to manually design a set of target-dependent features, and integrate them into existing feature-based SVM. However, feature engineering is labor intensive and the “sparse” and “discrete” features are clumsy in encoding side information like target-context relatedness. In addition, a person asked to do this task will naturally “look at” parts of relevant context words which are helpful to determine the sentiment polarity of a sentence towards the target. These motivate us to develop a powerful neural network approach, which is capable of learning continuous features (representations) without feature engineering and meanwhile capturing the intricate relatedness between target and context words.
In this paper, we present neural network models to deal with target-dependent sentiment classification. The approach is an extension on long short-term memory (LSTM) [Hochreiter and Schmidhuber1997]
by incorporating target information. Such target-dependent LSTM approach models the relatedness of a target word with its context words, and selects the relevant parts of contexts to infer the sentiment polarity towards the target. The model could be trained in an end-to-end way with standard backpropagation, where the loss function is cross-entropy error of supervised sentiment classification.
We apply the neural model to target-dependent sentiment classification on a benchmark dataset [Dong et al.2014]. We compare with feature-based SVM [Jiang et al.2011], adaptive recursive neural network [Dong et al.2014] and lexicon-enhanced neural network [Vo and Zhang2015]. Empirical results show that the proposed approach without using syntactic parser or external sentiment lexicon obtains state-of-the-art classification accuracy. In addition, we find that modeling sentence with standard LSTM does not perform well on this target-dependent task. Integrating target information into LSTM could significantly improve the classification accuracy.
We describe the proposed approach for target-dependent sentiment classification in this section. We first present a basic long short-term memory (LSTM) approach, which models the semantic representation of a sentence without considering the target word being evaluated. Afterwards, we extend LSTM by considering the target word, obtaining the Target-Dependent Long Short-Term Memory (TD-LSTM) model. Finally, we extend TD-LSTM with target connection, where the semantic relatedness of target with its context words are incorporated.
In this part, we describe a long short-term memory (LSTM) model for target-dependent sentiment classification. It is a basic version of our approach. In this setting, the target to be evaluated is ignored so that the task is considered in a target independent way.
We use LSTM as it is a state-of-the-art performer for semantic composition in the area of sentiment analysis [Li et al.2015a, Tang et al.2015]. It is capable of computing the representation of a longer expression (e.g. a sentence) from the representation of its children with multi levels of abstraction. The sentence representation can be naturally considered as the feature to predict the sentiment polarity of sentence.
Specifically, each word is represented as a low dimensional, continuous and real-valued vector, also known as word embedding [Bengio et al.2003, Mikolov et al.2013, Pennington et al.2014, Tang et al.2014]. All the word vectors are stacked in a word embedding matrix , where is the dimension of word vector and is vocabulary size. In this work, we pre-train the values of word vectors from text corpus with embedding learning algorithms [Pennington et al.2014, Tang et al.2014] to make better use of semantic and grammatical associations of words.
We use LSTM to compute the vector of a sentence from the vectors of words it contains, an illustration of the model is shown in Figure 1
. LSTM is a kind of recurrent neural network (RNN), which is capable of mapping vectors of words with variable length to a fixed-length vector by recursively transforming current word vectorwith the output vector of the previous step . The transition function of standard RNN is a linear layer followed by a pointwise non-linear layer such as hyperbolic tangent function ().
where , , is dimension of word vector. However, standard RNN suffers the problem of gradient vanishing or exploding [Bengio et al.1994, Hochreiter and Schmidhuber1997], where gradients may grow or decay exponentially over long sequences. Many researchers use a more sophisticated and powerful LSTM cell as the transition function, so that long-distance semantic correlations in a sequence could be better modeled. Compared with standard RNN, LSTM cell contains three additional neural gates: an input gate, a forget gate and an output gate. These gates adaptively remember input vector, forget previous history and generate output vector [Hochreiter and Schmidhuber1997]. LSTM cell is calculated as follows.
where stands for element-wise multiplication,
is sigmoid function,, , , , , are the parameters of input, forget and output gates.
After calculating the hidden vector of each position, we regard the last hidden vector as the sentence representation [Li et al.2015a, Tang et al.2015]. We feed it to a linear layer whose output length is class number, and add a
layer to output the probability of classifying the sentence as positive, negative or neutral. Softmax function is calculated as follows, whereis the number of sentiment categories.
The aforementioned LSTM model solves target-dependent sentiment classification in a target-independent way. That is to say, the feature representation used for sentiment classification remains the same without considering the target words. Let us again take “I bought a new camera. The picture quality is amazing but the battery life is too short” as an example. The representations of this sentence with regard to picture quality and battery life are identical. This is evidently problematic as the sentiment polarity labels towards these two targets are different.
To take into account of the target information, we make a slight modification on the aforementioned LSTM model and introduce a target-dependent LSTM (TD-LSTM) in this subsection. The basic idea is to model the preceding and following contexts surrounding the target string, so that contexts in both directions could be used as feature representations for sentiment classification. We believe that capturing such target-dependent context information could improve the accuracy of target-dependent sentiment classification.
Specifically, we use two LSTM neural networks, a left one LSTM and a right one LSTM, to model the preceding and following contexts respectively. An illustration of the model is shown in Figure 1. The input of LSTM is the preceding contexts plus target string, and the input of LSTM is the following contexts plus target string. We run LSTM from left to right, and run LSTM from right to left. We favor this strategy as we believe that regarding target string as the last unit could better utilize the semantics of target string when using the composed representation for sentiment classification. Afterwards, we concatenate the last hidden vectors of LSTM and LSTM, and feed them to a layer to classify the sentiment polarity label. One could also try averaging or summing the last hidden vectors of LSTM and LSTM as alternatives.
Compared with LSTM model, target-dependent LSTM (TD-LSTM) could make better use of the target information. However, we think TD-LSTM is still not good enough because it does not capture the interactions between target word and its contexts. Furthermore, a person asked to do target-dependent sentiment classification will select the relevant context words which are helpful to determine the sentiment polarity of a sentence towards the target.
Based on the consideration mentioned above, we go one step further and develop a target-connection long short-term memory (TC-LSTM). This model extends TD-LSTM by incorporating an target connection component, which explicitly utilizes the connections between target word and each context word when composing the representation of a sentence.
An overview of TC-LSTM is illustrated in Figure 2. The input of TC-LSTM is a sentence consisting of words and a target string occurs in the sentence. We represent target as because a target could be a word sequence of variable length, such as “google” or “harry potter”. When processing a sentence, we split it into three components: target words, preceding context words and following context words. We obtain target vector by averaging the vectors of words it contains, which has been proven to be simple and effective in representing named entities [Socher et al.2013a, Sun et al.2015]. When compute the hidden vectors of preceding and following context words, we use two separate long short-term memory models, which are similar with the strategy used in TD-LSTM. The difference is that in TC-LSTM the input at each position is the concatenation of word embedding and target vector , while in TD-LSTM the input at each position only includes the embedding of current word. We believe that TC-LSTM could make better use of the connection between target and each context word when building the representation of a sentence.
We train LSTM, TD-LSTM and TC-LSTM in an end-to-end way in a supervised learning framework. The loss function is the cross-entropy error of sentiment classification.
where is the training data, is the number of sentiment categories, means a sentence, is the probability of predicting as class given by the layer, indicates whether class
is the correct sentiment category, whose value is 1 or 0. We take the derivative of loss function through back-propagation with respect to all parameters, and update parameters with stochastic gradient descent.
We apply the proposed method to target-dependent sentiment classification to evaluate its effectiveness. We describe experimental setting and empirical results in this section.
We conduct experiment in a supervised setting on a benchmark dataset [Dong et al.2014]
. Each instance in the training/test set has a manually labeled sentiment polarity. Training set contains 6,248 sentences and test set has 692 sentences. The percentages of positive, negative and neutral in training and test sets are both 25%, 25%, 50%. We train the model on training set, and evaluate the performance on test set. Evaluation metrics are accuracy and macro-F1 score over positive, negative and neutral categories[Manning and Schütze1999, Jurafsky and Martin2000].
We compare with several baseline methods, including:
In SVM-indep, SVM classifier is built with target-independent features, such as unigram, bigram, punctuations, emoticons, hashtags, the numbers of positive or negative words in General Inquirer sentiment lexicon. In SVM-dep, target-dependent features [Jiang et al.2011] are also concatenated as the feature representation.
In Recursive NN, standard Recursive neural network is used for feature learning over a transfered target-dependent dependency tree [Dong et al.2014]. AdaRNN-w/oE, AdaRNN-w/E and AdaRNN-comb are different variations of adaptive recursive neural network [Dong et al.2014], whose composition functions are adaptively selected according to the inputs.
In Target-dep, SVM classifier is built based on rich target-independent and target-dependent features [Vo and Zhang2015]. In Target-dep, sentiment lexicon features are further incorporated.
The neural models developed in this paper are abbreviated as LSTM, TD-LSTM and TC-LSTM, which are described in the previous section. We use 100-dimensional Glove vectors learned from Twitter, randomize the parameters with uniform distribution
, set the clipping threshold of softmax layer as 200 and set learning rate as 0.01.
Experimental results of baseline models and our methods are given in Table 1. Comparing between SVM-indep and SVM-dep, we can find that incorporating target information can improve the classification accuracy of a basic SVM classifier. AdaRNN performs better than feature based SVM by making use of dependency parsing information and tree-structured semantic composition. We can find that target-dep is a strong performer even without using lexicon features. It benefits from rich automatic features generated from word embeddings.
Among LSTM based models described in this paper, the basic LSTM approach performs worst. This is not surprising because this task requires understanding target-dependent text semantics, while the basic LSTM model does not capture any target information so that it predicts the same result for different targets in a sentence. TD-LSTM obtains a big improvement over LSTM when target signals are taken into consideration. This result demonstrates the importance of target information for target-dependent sentiment classification. By incorporating target-connection mechanism, TC-LSTM obtains the best performances and outperforms all baseline methods in term of classification accuracy.
Comparing between Target-dep and Target-dep, we find that sentiment lexicon feature could further improve the classification accuracy. Our final model TC-LSTM without using sentiment lexicon information performs comparably with Target-dep. We believe that incorporation lexicon information in TC-LSTM could get further improvement. We leave this as a potential future work.
It is well accepted that a good word embedding is crucial to composing a powerful text representation at higher level. We therefore study the effects of different word embeddings on LSTM, TD-LSTM and TC-LSTM in this part. Since the benchmark dataset from [Dong et al.2014] comes from Twitter, we compare between sentiment-specific word embedding (SSWE)222SSWE vectors are publicly available at http://ir.hit.edu.cn/~dytang [Tang et al.2014] and Glove vectors333Glove vectors are publicly available at http://nlp.stanford.edu/projects/glove/ [Pennington et al.2014]. All these word vectors are 50-dimensional and learned from Twitter. SSWE, SSWE and SSWE are different embedding learning algorithms introduced in [Tang et al.2014]. SSWE and SSWE learn word embeddings by only using sentiment of sentences. SSWE takes into account of sentiment of sentences and contexts of words simultaneously.
From Figure 3, we can find that SSWE and SSWE perform worse than SSWE, which is consistent with the results reported on target-independent sentiment classification of tweets [Tang et al.2014]. This shows the importance of context information for word embedding learning as both SSWE and SSWE do not encode any word contexts. Glove and SSWE
perform comparably, which indicates the importance of global context for estimating a good word representation. In addition, the target connection model TC-LSTM performs best when considering a specific word embedding.
We compare between Glove vectors with different dimensions (50/100/200). Classification accuracy and time cost are given in Figure 3 and Table 2, respectively. We can find that 100-dimensional word vectors perform better than 50-dimensional word vectors, while 200-dimensional word vectors do not show significant improvements. Furthermore, TD-LSTM and LSTM have similar time cost, while TD-LSTM gets higher classification accuracy as target information is incorporated. TC-LSTM performs slightly better than TD-LSTM while at the cost of longer training time because the parameter number of TC-LSTM is larger.
In this section, we explore to what extent the target-dependent LSTM models including TD-LSTM and TC-LSTM improve the performance of a basic LSTM model.
|i hate my ipod look at my last tweet before the argh one that ’s for you||-1||0|
|okay soooo … ummmmm …. what is going on with lindsay lohan’ s face? boring day at the office = perez and tomorrow overload. not good||0||-1|
|i heard ShannonBrown did his thing in the lakers game!! got ta love him||0||1|
|Hey google, thanks for all these great Labs features on Chromium, but how about ” Create Application Shortcut”?!||1||0|
In Table 3, we list some examples whose polarity labels are incorrectly inferred by LSTM but correctly predicted by both TD-LSTM and TC-LSTM. We observe that LSTM model prefers to assigning the polarity of the entire sentence while ignoring the target to be evaluated. TD-LSTM and TC-LSTM could take into account of target information to some extend. For example, in the 2nd example the opinion holder expresses a negative opinion about his work, but holds a neutral sentiment towards the target “lindsay lohan”. In the last example, the whole sentence expresses a neutral sentiment while it holds a positive opinion towards “google”.
We analyse the error cases that both TD-LSTM and TC-LSTM cannot well handle, and find that 85.4% of the misclassified examples relate to neutral category. The positive instances are rarely misclassified as negative, and vice versa. A example of errors is: “freaky friday on television reminding me to think wtf happened to lindsay lohan, she was such a terrific actress , + my huge crush on haley hudson.”, which is incorrectly predicted as positive towards target “indsay lohan” in both TD-LSTM and TC-LSTM.
In order to capture the semantic relatedness between target and context words, we extend TD-LSTM by adding a target connection component. One could also try other extensions to capture the connection between target and context words. For example, we also tried an attention-based LSTM model, which is inspired by the recent success of attention-based neural network in machine translation [Bahdanau et al.2015] and document encoding [Li et al.2015b]. We implement the soft-attention mechanism [Bahdanau et al.2015]
to enhance TD-LSTM. We incorporate two attention layers for preceding LSTM and following LSTM, respectively. The output vector for each attention layer is the weighted average among hidden vectors of LSTM, where the weight of each hidden vector is calculated with a feedforward neural network. The outputs of preceding and following attention models are concatenated and fed tofor sentiment classification. However, we cannot obtain better result with such an attention model. The accuracy of this attention model is slightly lower than the standard LSTM model (around 65%), which means that the attention component has a negative impact on the model. A potential reason might be that the attention based LSTM has larger number of parameters, which cannot be easily optimized with the small number of corpus.
We briefly review existing studies on target-dependent sentiment classification and neural network approaches for sentiment classification in this section.
Target-dependent sentiment classification is typically regarded as a kind of text classification problem in literature. Therefore, standard text classification approach such as feature-based Supported Vector Machine [Pang et al.2002, Jiang et al.2011] can be naturally employed to build a sentiment classifier. Despite the effectiveness of feature engineering, it is labor intensive and unable to discover the discriminative or explanatory factors of data. To handle this problem, some recent studies [Dong et al.2014, Vo and Zhang2015] use neural network methods and encode each sentence in continuous and low-dimensional vector space without feature engineering. Dong et al. Dong2014a transfer a dependency tree of a sentence into a target-specific recursive structure, and get higher level representation based on that structure. Vo and Zhang Vo2015 use rich features including sentiment-specific word embedding and sentiment lexicons. Different from previous studies, the LSTM models developed in this work are purely data-driven, and do not rely on dependency parsing results or external sentiment lexicons.
Neural network approaches have shown promising results on many sentence/document-level sentiment classification [Socher et al.2013b, Tang et al.2015]. The power of neural model lies in its ability in learning continuous text representation from data without any feature engineering. For sentence/document level sentiment classification, previous studies mostly have two steps. They first learn continuous word vector embeddings from data [Bengio et al.2003, Mikolov et al.2013, Pennington et al.2014]. Afterwards, semantic compositional approaches are used to compute the vector of a sentence/document from the vectors of its constituents based on the principle of compositionality [Frege1892]. Representative compositional approaches to learn sentence representation include recursive neural networks [Socher et al.2013b, Irsoy and Cardie2014]Kalchbrenner et al.2014, Kim2014], long short-term memory [Li et al.2015a] and tree-structured LSTM [Tai et al.2015, Zhu et al.2015]. There also exists some studies focusing on learning continuous representation of documents [Le and Mikolov2014, Tang et al.2015, Bhatia et al.2015, Yang et al.2016].
We develop target-specific long short term memory models for target-dependent sentiment classification. The approach captures the connection between target word and its contexts when generating the representation of a sentence. We train the model in an end-to-end way on a benchmark dataset, and show that incorporating target information could boost the performance of a long short-term memory model. The target-dependent LSTM model obtains state-of-the-art classification accuracy.
We greatly thank Yaming Sun for tremendously helpful discussions. This work was supported by the National High Technology Development 863 Program of China (No. 2015AA015407), National Natural Science Foundation of China (No. 61632011 and No.61273321). According to the meaning given to this role by Harbin Institute of Technology, the contact author of this paper is Bing Qin.
When are tree structures necessary for deep learning of representations?EMNLP.
A hierarchical neural autoencoder for paragraphs and documents.ACL.
Reasoning with neural tensor networks for knowledge base completion.NIPS.