Sentiment Classification with Word Attention based on Weakly Supervised Learning with a Convolutional Neural Network

09/28/2017
by   Gichang Lee, et al.
0

In order to maximize the applicability of sentiment analysis results, it is necessary to not only classify the overall sentiment (positive/negative) of a given document but also to identify the main words that contribute to the classification. However, most datasets for sentiment analysis only have the sentiment label for each document or sentence. In other words, there is no information about which words play an important role in sentiment classification. In this paper, we propose a method for identifying key words discriminating positive and negative sentences by using a weakly supervised learning method based on a convolutional neural network (CNN). In our model, each word is represented as a continuous-valued vector and each sentence is represented as a matrix whose rows correspond to the word vector used in the sentence. Then, the CNN model is trained using these sentence matrices as inputs and the sentiment labels as the output. Once the CNN model is trained, we implement the word attention mechanism that identifies high-contributing words to classification results with a class activation map, using the weights from the fully connected layer at the end of the learned CNN model. In order to verify the proposed methodology, we evaluated the classification accuracy and inclusion rate of polarity words using two movie review datasets. Experimental result show that the proposed model can not only correctly classify the sentence polarity but also successfully identify the corresponding words with high polarity scores.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 10

page 11

page 12

page 13

page 14

07/04/2018

A Convolutional Neural Network for Aspect Sentiment Classification

With the development of the Internet, natural language processing (NLP),...
06/21/2013

Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences

We present a method for learning word meanings from complex and realisti...
11/21/2016

Unsupervised Learning for Lexicon-Based Classification

In lexicon-based classification, documents are assigned labels by compar...
02/23/2021

A Novel Deep Learning Method for Textual Sentiment Analysis

Sentiment analysis is known as one of the most crucial tasks in the fiel...
01/31/2020

Hybrid Tiled Convolutional Neural Networks for Text Sentiment Classification

The tiled convolutional neural network (tiled CNN) has been applied only...
11/20/2020

A Deep Language-independent Network to analyze the impact of COVID-19 on the World via Sentiment Analysis

Towards the end of 2019, Wuhan experienced an outbreak of novel coronavi...
09/29/2021

Classifying Tweet Sentiment Using the Hidden State and Attention Matrix of a Fine-tuned BERTweet Model

This paper introduces a study on tweet sentiment classification. Our tas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis and opinion mining is a field of study that analyzes people’s opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing (NLP) and has also been widely studied in data mining, Web mining, and text mining

(Medhat et al., 2014; Liu, 2012; Pang et al., 2008; Ravi & Ravi, 2015) Application domains for sentiment analysis include analyses of customer response to new products or services, analyses of public opinion towards the government’s new policies or political issues under debate, etc. (Jo, 2012). In response to increasing needs in diverse domains, various sentiment analysis techniques have been developed (Gui et al., 2017; Cho et al., 2014; Poria et al., 2016; Xianghua et al., 2013; Socher et al., 2013; Kalchbrenner et al., 2014; Tai et al., 2015). However, many of the current sentiment analysis techniques suffer from the over-abstraction problem (Nasukawa & Yi, 2003); the only information obtained from these techniques is the polarity of the document, i.e., whether the nuance of the document is positive or negative. It is difficult to receive more in-depth sentiment analysis results, such as identifying the main words contributing to the polarity classification or finding opposite words or phrase to the overall sentiment of the document, i.e., negative words/phrases in a positive document or positive words/phrases in a negative document.

Recently, attention models have been highlighted in the field of computer vision because of its ability to focus on semantically significant areas in a given image to solve the task of object classification, localization, and detection

(Ba et al., 2014; Russakovsky et al., 2015; Mnih et al., 2014). They have also been widely adopted in the field of NLP, as attention models can provide more fruitful interpretations for text analysis tasks (Luong et al., 2015; Shen & Huang, 2016; Rush et al., 2015)

. Attention models help the NLP model focus on salient words/phrases and transfer these attentions to other machine learning models to solve more complicated tasks such as image captioning or text to image generation

(Xu et al., 2015)

. In addition, as one of the basic building blocks of artificial intelligence (AI) is to understand a human speaker’s intention, global technology leaders have released their own AI speakers, such as Amazon’s “Eco,” Google’s “Google Home,” and Apple’s “Homepod,” to collect real-word conversational data in order to upgrade their AI engines. As these AI speakers process the human speaker’s query at a sentence level, it becomes more critical to correctly identify the main intentions (words/phrases) of the speaker, which is the ultimate goal of attention models.

It is not that easy to implement an attention model in NLP tasks. This is mainly because most text datasets have document-level labels, i.e., whether the overall nuance of the document is positive or negative, but phrase- or word-level sentiment labels are rarely available. It implies that there is a restriction that the model should learn attention scores for words or phrases without actual labels. To overcome this problem, previous studies modified the structure of a recurrent neural network (RNN) such that the added weights play an attention role inside the model. Applications of RNN-based attention models include document classification

(Yang et al., 2016), parsing (Vinyals et al., 2015), machine translation (Bahdanau et al., 2014; Luong et al., 2015), and image captioning (Xu et al., 2015).

In this paper, we propose a sentiment classification with a word attention model based on weakly supervised leaning with a convolutional neural network (CNN), named CAM: Classification and Attention Model with a Class Activation Map. The main advantage of the proposed model is its ability to identify crucial words or phrases in a sentence for the sentiment classification perspective without explicit word- or phrase-level sentiment polarity information. It identifies the words by weak labels only, i.e., the sentence-level polarity that is more abstracted but easily available. In the proposed model, words are embedded in a fixed-size of continuous vector space using Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2016). Sentences are represented in a matrix form, whose rows correspond to word vectors, and they are used as the input of a CNN model. The CNN model is trained by considering the sentence-level sentiment polarity as the target, and it produces both the sentence-level polarity score and word-level polarity scores for all words in the sentence, which helps us understand the result of sentence-level sentiment classification. Unlike the existing attention models based on RNN, there is no need to separately learn the weights for the attention. Considering that the same word is used in different contexts for different domains, it is relatively easy to build a dictionary that reflects the characteristics of each domain by using the proposed model.

The rest of this paper is organized as follows. In Section 2, we briefly review and discuss some related works. In Section 3, we demonstrate the architecture of the proposed model. Detailed experimental settings are demonstrated in Section 4 followed by the analysis and discussion of the results. Finally, in Section 5 we present our conclusions.

2 Related Work

In this section, we briefly review the representative studies on for CNN-based document classification (Kim, 2014), weakly supervised learning for CNN-based object detection (Oquab et al., 2015; Zhou et al., 2016), and the RNN-based document attention model named the hierarchical attention network (Yang et al., 2016).

2.1 Convolutional Neural Networks for Document Classification

Kim (2014) showed CNN, which is the most successful neural network structure for image processing, can also work well for text data, especially for document classification. The architecture of Kim (2014) is shown in Figure 1, and it has the following three main ideas:

  1. A large number of filters are used, but the network is not as deep as popular CNN architectures for image processing.

  2. The size of the CNN filter is matched with the vector size of input words.

  3. Multi-channels consisting of static and non-static input vectors are combined.

Experimental results show that the CNN-based document classification model achieved higher classification accuracies than the conventional machine learning-based models, such as the support vector machine or conditional random field, and other deep neural network structures, such as the deep feedforward neural network or recursive neural network. In addition, the word vector could also be customized for a given corpus, and it sometimes yielded better classification performance than pre-trained word vectors.

Figure 1: Model architecture with two channels for an example sentence (Kim, 2014).
Figure 2: Class activation mapping (Zhou et al., 2016).

2.2 Class Activation Mapping

Figure 3: Hierarchical Attention Network (Yang et al., 2016).

Oquab et al. (2015)

proposed a weakly supervised learning method for object detection without bounding box information. In this study, a standard CNN architecture with max pooling between the final convolution and the output layer was utilized.

Zhou et al. (2016) proved the average pooling is more appropriate for the object detection task than the max pooling. The CNN structure and an example of the attention mechanism are shown in Figure 2. In this model, the CNN is trained to correctly classify the object in the input image. In Figure 2, the target of the given image is “Australian terrier,” but no information on the dog’s position in the input image is available during the training. When the training is complete, the weights in the fully connected layers are used to combine the feature map to emphasize the attention area of the original input image. They called this process class activation mapping (CAM), and by utilizing it, not only can the CNN model determine that the “Australian terror” is in the image, but also this classification is mainly inferred by seeing the bottom right part of the image (red area in the final CAM in Figure 2).

2.3 Hierarchical Attention Network

Yang et al. (2016) proposed a hierarchical RNN architecture, inspired by the fact that the document consists of sentences and the sentences are composed of words. In the study, the authors added attention weights to reflect the importance of each sentence and word. As can be seen in Figure 3, the result of their model is the most similar to what we attempted to do in this study. However, the main differences between their work and this work is that Yang et al. (2016) employed an RNN as the base model and the attention weights were separately learned from the corpus. However, a CNN is employed as the base model for sentiment classification in this study, and we do not explicitly train the model to learn the word-level attention scores.

3 Classification and Attention Model based on Class Activation Map: CAM

3.1 Overall Framework

Figure 4 shows the overall framework of the proposed method. After collecting the sentences, low-level embedding is performed by the Word2Vec, GloVe, and FastText methods, and the word vectors in the sentence are concatenated to form the initial input matrix for the CNN. Once the CNN model training is completed, the polarity of a given test sentence is predicted. Then, the weights of the fully connected layer are used to combine the feature maps to produce the attention score for every single word in the sentence.

Figure 4: Framework of proposed method.

3.2 Network Architecture

The architecture of the CNN used in this paper is basically rooted in the CNN architecture used in Kim (2014). However, since the CNN used in Kim (2014)

was originally designed for document classification, we made some modifications to it to facilitate the extraction of essential words or phrases. First, the zero-padding is added before the first word and after the last word in the sentence to make that the number of times that each word is included in the receptive field during convolution the same, irrespective of the word’s position in the sentence. Second, we applied average-pooling instead of max-pooling. According to

Zhou et al. (2016), average-pooling and max-pooling are essentially similar, but using average-pooling is advantageous in identifying the overall scope of the target. Third, we increased the number of filters compared to the CAMs used in Oquab et al. (2015) and Zhou et al. (2016). As these CAMs are specialized for image processing, the receptive field of convolution is a square (ex: 3 × 3). However, the receptive field of the proposed CAM is a rectangular (ex: 3 × word embedding dimension), which integrates a larger amount of information in one scalar value compared to the convolutional filter in image processing. To prevent a possible loss of information due to a larger receptive field, we used a much larger number of convolution filters than was used in (Kim, 2014). Finally, we used more various word embedding techniques to form an input matrix of a sentence. Kim (2014) only used the Word2Vec for word embedding, but we consider two recently developed word embedding techniques: GloVe and FastText.

3.3 Classification and Attention Model based on Class Activation Map

The input of CNN, is created by concatenating the word vectors in a sentence and zero-paddings. We used four type of inputs CNN-rand, CNN-static, CNN-non-static, and CNN-Multichannel. The CNN-rand uses a randomly initialized word vector while CNN-static and CNN-non-static use the word vectors pre-trained by the Word2Vec. CNN-Multichannel uses the word vectors pre-trained by the Word2Vec, GloVe, and FastText. Let , , and denote the dimension of the word embedding vector, number of maximum words in a sentence, and the height of the receptive field of convolution, respectively, then the input matrix is constructed as follows. The zero-padding is first performed before and after so that the number of times that each word is included in the receptive field during convolution is the same ( times).

(1)

When the window size of the CNN filter, i.e., the height of filter is , the -th feature map is constructed as follows. As the size of CNN filter w is and zero-padding is performed in the previous step, becomes a -dimensional vector, where is .

(2)
(3)
(4)

Let ␣̂be the scalar value computed by applying the average pooling to the feature map . The final feature vector passed to the fully connected layer is constructed as follows. Considering that n feature maps are computed for a given sentence, becomes an -dimensional vector.

Figure 5: An example of computing a score vector.
(5)

where is (the number of filter type) (the number of filters for each type). The output of the fully connected layer for the -th sentence is , computed as follows:

(6)
(7)
(8)

Once the CNN model is trained, the sentiment importance score of each word is computed as follows. An illustrated example of the following process is provided in Figure 5. Let be the feature maps corresponding to the -th filter type and be the row vector of for the -th filter type and the -th class. Then, the score vector is computed as

(9)
(10)
(11)

The -th element in the score vector corresponding to the -th filter type and the -th class is computed by averaging elements with the step size of 1, which makes the a -dimensional vector, regardless of the height of filters:

(12)

The final sentiment score of the words in the sentence to -th class, is computed by

(13)
Score 1 2 3 4 7 8 9 10
Reviews 10,122 4,586 4,961 5,531 4,803 5,859 4,607 9,731
Class Negative Positive
Table 1: Rating distributions of the IMDB dataset
score 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Reviews 50,660 66,184 62,094 163,272 173,650 411,757 424,378 652,250 297,327 416,096
Class Negative Not used Positive
Table 2: Rating distributions of the WATCHA dataset
IMDB WATCHA
115,205 424,027
The number of tokens
Filter type (window size) 3 (tri-gram)
4 (quad-gram)
5 (5-gram)
N. filters 128 each
Doc. length 100 words
Dropout rate 0.5
regularization () 0.1
Batch size 64
The hyper-parameters of the CNN

3.4 Word Embedding

We employed four different word embedding methods to construct the input matrix X: random vectors, Word2Vec, GloVe, and FastText. With the random vectors, the elements of the word vectors were randomly initialized, and they were updated during the CNN training. For the latter three methods, word embedding vectors were separately trained using the same corpus for sentiment classification. We also compared the static word embedding and non-static word embedding methods for CAM according to whether the word embedding vectors are updated during the CNN training (non-static) or not (static). In addition, two multi-channel input matrices were also considered. In summary, we tested the following five input matrices for CAM.

  1. CNN-Rand: word vectors are randomly initialized and they are updated during the CNN training.

  2. CNN-Static: word vectors are trained by Word2Vec. They are not updated during the CNN training.

  3. CNN-Non-Static: word vectors are trained by Word2Vec first, and they are updated during the CNN training.

  4. CNN-2ch: CNN-Static and CNN-Non-Static are combined. The input of CNN becomes a 3-dimensional (I k

    2) tensor.

  5. CNN-4ch: Three matrices with word vectors trained by Word2Vec, GloVe, and FastText are used. They are updated during the CNN training. The CNN-Non-Static method is used as the fourth matrix. The input of CNN becomes a 3-dimensional (I k 4) tensor.

4 Experimental Settings

4.1 Data Sets & Target Labeling

To verify the proposed CAM, we used two sets of movie reviews, one written in English and the other written in Korean. Not only do movie reviews have explicit sentiment labels (ratings or stars), but they generally also have more subjective expressions compared to other formal texts such as news articles. For the English movie review dataset, we used the publicly available IMDB dataset (Maas et al., 2011), while Korean movie reviews were collected directly from the WATCHA website (https://watcha.net), which is the largest movie recommendation service in Korea. Each dataset consists of review sentences and ratings. The distributions of ratings for the IMDB and WATCHA are shown in Table 1 and 2.

As shown in Table 2, the ratings are well-balanced in the IMDB dataset. Hence, we used the reviews with ratings smaller than or equal to 4 as negative examples, whereas the reviews with ratings greater than or equal to 7 were used as positive examples. Unlike for the IMBD dataset, the ratings of the WATCH dataset are highly skewed toward the positive scores. Therefore, we used the reviews with ratings smaller than or equal to 2 as negative examples whereas only the reviews with 5-point-ratings were used as positive examples. In both datasets, 70% of the reviews were used as training data, and the remaining 30% were used as test data.

4.2 Word Embedding, CNN Parameters, and Performance Measure

Test IMDB WATCHA
CNN-Rand 0.8435 0.7793
CNN-Static 0.7750 0.7150
CNN-Non-Static 0.8257 0.7538
CNN-2channel 0.8300 0.7602
CNN-4channel 0.8729 0.7533
The test accuracy between methodology
Word Score
this 0.0145
film 0.0291
is 0.1324
actually 0.2183
quite 0.2561
entertaining 0.3496
CAM example
Positive Negative
CNN- Rand CNN- Static CNN-Non- Static CNN- 2channel CNN- 4channel CNN- Rand CNN- Static CNN-Non- Static CNN- 2channel CNN- 4channel
the and and and and the the the the the
and great is is is a is and and and
a is the the a and was worst worst of
of a a a the of and of of a
is very of of of to bad a a worst
to the s s s is a is is is
I well excellent excellent it I this the was was
in film it great excellent in of awful awful I
this of great it to this plot boring to to
it it to in I that just was I awful
that I in to great was acting to boring this
was as an I it it movie I movie movie
as excellent I an in movie I bad this boring
movie wonderful was perfect perfect for awful poor poor bad
with movie perfect with very with script this bad s
for story as as was as boring movie waste in
film in best very fun have to waste terrible waste
but favorite very best by on that terrible in poor
on beautiful enjoyed enjoyed enjoyed film so in s terrible
an my with wonderful an but t s with for
have good wonderful fun as not terrible horrible are with
are comedy fun by with be it are as as
one loved by movie best are stupid with by are
his also amazing amazing wonderful you horrible for acting it
you most movie loved loved an in by for that
not s most that amazing at are film horrible film
be best loved superb most his film as film by
who enjoyed superb film are one no it that horrible
by love that most for from worst poorly it so
Table 7: Frequently appeared words in the positive/negative sentences in in the IMDB test dataset (semantically positive or negative words are colored in blue and red, respectively)

Each sentence was split into tokens using the space. The punctuations and numbers were removed. All tokens were used to learn the word embedding vectors. We fixed the dimension of word embedding to 100 and set the window size of Word2Vec and FastText to 3. For Word2Vec and FastText, we used the skip-gram structure, while unigram was used to create the co-occurrence matrix for GloVe. The total number of tokens for each dataset is shown in Table 3.

The hyper-parameters for training CNN are summarized in Table 4. We used three different window sizes (how many words are considered in one receptive field), while the number of filters was fixed to 128. The document length, i.e., the maximum number of words, was set to 100. For sentences shorter than 100 words, zero-paddings were added after the last word, whereas the last words were trimmed if sentences were longer than 100 words. We also used two regularization methods. The dropout is an implicit regularization that ignores some weights in each step (dropout rate = 0.5 in this study), whereas the regularization is an explicit regularization that adds the

-norm of the total weight in the loss function.

Positive Negative
CNN- Rand CNN- Static CNN-Non- Static CNN- 2channel CNN- 4channel CNN- Rand CNN- Static CNN-Non- Static CNN- 2channel CNN- 4channel

영화
영화 영화 영화 영화 영화 영화 영화 영화 영화
너무 최고의 (best) 최고의 (best) 너무 너무 너무 너무 너무 너무
최고의 (best) 너무 너무 최고의 (best) 그냥 그냥
정말 그리고 없고 (none) 그냥
다시 그리고 없는 (none) 없는 (none)
영화를 정말 그리고 그냥 없는 (none)
없는 (none) 가장 정말 없다 (none) 없고 (none)
그냥 그리고 가장 정말 그냥 없는 (none) 없고 (none)
없다 (none) 가장 영화를 영화는 영화는 그나마
진짜 최고 (best) 없다 (none) 영화를
너무 있는 최고 (best) 없는 (none) 이런 정말
이런 가장 진짜 진짜 다시 느낌 없고 (none) 없다 (none) 정말
다시 최고 (best) 뻔한 (obvious) 영화를 영화를
있는 진짜 내가 영화를 그나마
영화가 좋다 (good) 이런 보는 영화는
영화는 아름다운 (beautiful) 모든 영화가 보고 영화가 (not) 정말 영화는 없다 (none)
진짜 함께 영화가 보고 좋다 (good) 정말 없다 (none) 이런
보고 있는 봐도 영화는 그나마
내가 이렇게 무슨 보는 이런
정말 최고 (best) 다시 마지막 있는 이런 대한 별로 (not much of)
작품 내가 보는 것도 보는 영화가
모든 봐도 모든 모든 스토리 차라리 (rather)
내가 내가 완벽한 (perfect) 진짜 많이 대한 차라리 (rather)
이렇게 내내 좋다 (good) 이렇게 차라리 (rather) 차라리 (rather)
봐도 완벽한 (perfect) 내가 없고 (none) 아깝다 (wasted)
보는 좋다 (good) 봐도 하는 아닌 (not) 느낌
이건 있을까 마지막 영화를 뻔한 (obvious) 대한
좋은 (good) 대한 스토리 내가 봤는데 최악의 (worst)
보고 모두 완벽한 (perfect) 이건 못한 (not) 영화가 최악의 (worst)
Table 8: Frequently appearing words in the positive/negative sentences in in the WATCHA test dataset (semantically positive or negative words are in blue and red fonts, respectively)
Methodology Sentence
Raw text I’m normally not a Drama/Feel good movie kind of guy but once I saw the trailer for Radio I couldn’t resist. Not only is this a great film but it also has great acting. Cuba Gooding Jr. did an excellent job portraying James Robert Kennedy a.k.a. RAdio. Ed Harris also did a fantastic job as Coach Jones. I was pleasantly surprised to see some comedy in it as well. So for a great story great acting and a little comedy I give Radio a 10 out of 10! (10 / 10 points)
CNN-Rand I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive
CNN-Static I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive
CNN-Non-Static I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive
CNN-2channel I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive
CNN-4channel I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive
Table 9: Example of word attention for a positively classified sentence in the IMDB dataset

5 Result

5.1 Classification Performance

Table 5 shows the classification accuracies for the five CNN models. It is worth noting that the CNN-Static resulted in the lowest classification accuracy for both IMDB and WATCHA datasets. Since the CNN-Static is the only model which does not update the word embedding vectors during the CNN training, updating the word embedding vectors for a given corpus during the model training, whether or not the word vectors are independently trained before, is encouraged to achieve better classification performance.

Table 6 shows an example of CAM for a test sentence. The overall sentiment of this sentence is classified as positive. For each word, the higher the score, the CNN model considers it as a significantly contributing word to the overall sentiment. Thus, the word ’entertaining’ had the greatest impact on the classification of this review as being positive.

5.2 Finding Sentimental Words

Methodology Sentence
Raw text This is one of the most boring films I’ve ever seen. The three main cast members just didn’t seem to click well. Giovanni Ribisi’s character was quite annoying. For some reason he seems to like repeating what he says. If he was the Rain Man it would’ve been fine but he’s not. (3/10 points)
CNN-Rand This is one of the most boring films I’ve ever seen The three main cast members just didn t seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative
CNN-Static This is one of the most boring films I ve ever seen The three main cast members just didn t seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he’s not Negative
CNN-Non-Static This is one of the most boring films I’ve ever seen The three main cast members just didn t seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he’s not Negative
CNN-2channel This is one of the most boring films I ve ever seen The three main cast members just didn t seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative
CNN-4channel This is one of the most boring films I ve ever seen The three main cast members just didn t seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative
Table 10: Example of word attention for a negatively classified sentence in the IMDB dataset
Methodology Sentence
Raw text This movie has a lot to recommend it. The paintings the music and David Hewlett’s naked butt are all gorgeous! The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching – and it’s not predictable which is saying quite a lot about a movie in

this day and age. But the acting is mediocre the direction is confusing and the script is just odd. It

often felt like it was trying to be a parody but I never figured out what it was trying to be parody *of*. (9 / 10 points)
CNN-Rand This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adve- rsity is also very interesting and touching and it s not predictable which is saying quite a lot about a movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Negative
CNN-Static This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a movie in this day and age But the acting is mediocre the direction is confusing and the sc- ript is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Negative
CNN-Non-Static This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Positive
CNN-2channel This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a movie in this day and age But the acting is mediocre the direction is confusing and the sc- ript is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Positive
CNN-4channel This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Positive
Table 11: Example of word attention for a sentence in the IMDB dataset whose predicted class is different according to CNN models
Methodology Sentence
Raw text 살라딘의 기사도 정신이 진짜 감탄스럽다. 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장면 이 존경스럽다. (5 / 5 points) (Saladin’s Chivalry spirit is truly amazing. I’m very impressed by the scene of setting up the Jesus prize and passing without stepping on the floor of the cross.)
CNN-Rand 살라딘의 기사도 정신이 진짜 감탄스럽다 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장 면이 존경스럽다 Positive
CNN-Static 살라딘의 기사도 정신이 진짜 감탄스럽다 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장면이 존경스럽다 Positive
CNN-Non-Static 살라딘의 기사도 정신이 진짜 감탄스럽다 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장면이 존경스럽다 Positive
CNN-2channel 살라딘의 기사도 정신이 진짜 감탄스럽다 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장면이 존경스럽다 Positive
CNN-4channel 살라딘의 기사도 정신이 진짜 감탄스럽다 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장면이 존경스럽다 Positive
Table 12: Example of word attention for a positively classified sentence in the WATCHA dataset
Methodology Sentence
Raw text 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화! 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국. (2 / 5 points) (An ironic movie in which the most unnecessary and meaningless flaunt woman in the whole movie is being cheered! Soundtracks are acceptable but storytelling makes the audience run down. A total impa- sse in a word.)
CNN-Rand 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국 Negative
CNN-Static 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국 Negative
CNN-Non-Static 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국 Negative
CNN-2channel 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국 Negative
CNN-4channel 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국 Negative
Table 13: Example of word attention for a negatively classified sentence in the WATCHA dataset
Methodology Sentence
Raw text 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를… ( 1 / 5 points) (I would like to pay tribute to Bryan Singer, who just reconstituted this boring and messy X-Men as a story of the past, and Matthew Vaughn, who neatly rearranged it again.)
CNN-Rand 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를 Negative
CNN-Static 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를 Positive
CNN-Non-Static 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를 Negative
CNN-2channel 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를 Negative
CNN-4channel 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를 Negative
Table 14: Example of word attention for a sentence in the IMDB dataset whose predicted class is different according to CNN models

Table 7 provides the frequent words listed in the IMDB test dataset by selecting the top five highly scored words in the sentences classified as positive (left five columns) and negative (right five columns). It is worth noting that although the CNN-Rand yielded a relatively good classification performance compared to other techniques, it identified the least emotional words among the five CNN models. Although the classification performance of CNN-Static was the worst, its attention mechanism seemed to work well, in that many emotional words were highly ranked. In terms of classification performance, it is important whether or not the input vector is updated in the training process. However, for the sake of word attention in sentiment classification, it becomes more important whether the general grammatical relationship between the words are well-preserved in the word embedding vector (not updated for classification task).

Table 8 provides the frequent words listed in the WATCHA test dataset by selecting the top five highly scored words in the sentences classified as positive (left five columns) and negative (right five columns). In this case, the emotional word in the upper word list is somewhat overlapped with other methods compared to the IMDB dataset. This is because Korean is an agglutinative language, which tends to have a high rate of affixes per word. For example, “없다, 없는, 없고…(none),” “안, 아닌, 못…(not),” and “차라리(rather)” are usually used in Korean for negative expressions. Experimental results confirm that these words are more frequently used in the negative reviews than in the positive reviews (except CNN-Rand).

5.3 Word Attention: IMDB

Table 9 shows an example of word attention of a positively classified sentence in the IMDB dataset. The words highlighted in blue are the top 10% highly scored words in the sentence. The four models except the CNN-Rand can successfully capture semantically positive words or phrases (ex. excellent, fantastic, and was pleasantly surprised). In particular, the CNN-Static is especially good at paying attention to longer sentimental phrases such as “a great story great acting.”

Table 10 shows an example of word attention of a negatively classified sentence in the IMDB dataset. The words highlighted in red are the top 10% highly scored words in the sentence. If one reads the review, he/she can easily recognize multiple negative expressions within the review, which results in different attention words or phrases according to different models. For example, the CNN-Non-Static, CNN-2channel, and CNN-4channel pay attention to “boring” and “annoying,” both of which are clearly negative expressions when used in a movie review. However, there is another explicit negative expression, namely, “it would (have) been fine,” which receives an attention by the CNN-Rand.

Table 11 shows an example of attention results for a sentence whose predicted class is different according to the CNN models because of mixed emotional expressions within the sentence. In this case, the words in the top 10% highest scores are highlighted in blue and those in the bottom 10% lowest scores are highlighted in red if the sentence is classified as positive. The highlighting scheme is reversed if the sentence is classified as negative. Likewise, the CNN-Static, CNN-Non-Static, CNN-2channel, and CNN-4channel have relatively better attention performances than the CNN-Rand. Again, the CNN-Static has a relatively good performance in capturing longer emotional phrases such as “is also very interesting and touching.”

5.4 Word Attention: WATCHA

Table 12 shows an example of word attention of a positively classified sentence in the WATCHA dataset. The words highlighted in blue are the top 10% highly scored words in the sentence. In this sentence, there are two obvious positive expressions, i.e., 감탄스럽다 (impressing) and 존경스럽다 (admirable); the former was successfully detected by CNN-Static, CNN-Non-Static, CNN-2channel, and CNN-4channel while the latter was detected by CNN-Rand.

Table 13 shows an example of word attention of a negatively classified sentence in the WATCHA dataset. The words highlighted in blue are the top 10% highly scored words in the sentence. This sentence also has two semantically explicit negative expressions: “불필요하고 의미없는 가오 (unnecessary and meaningless flaunt)” and “한마디로 총체적 난국 (a total crisis in a word).” The CNN-Rand focused on the former expression, whereas the rest of the four models focused on the latter expression. Similar to the example of the positive sentence in Table 12, it seems that the attention mechanism of CNN-Rand is somewhat different from those of the other models. This is mainly because the word embedding vectors are not updated to reflect the user’s rating information. Hence, more general emotional expressions, rather than movie-review specific expressions, receive higher attention by the CNN-Rand.

Table 14 shows an examples in the same manner as the example illustrated in Table 11. The three models except CNN-Rand and CNN-Static focus on the negative phrase “재미없고 (boring)” and the positive phrase “깔끔하게 (neatly)”. Qualitatively, the former is a stronger emotional expression than the latter, which results in the entire sentence being predicted as negative. However, the CNN-Static finds a stronger positive expression, i.e., “박수를 (pay tribute to)” rather than “깔끔하게 (neatly)”, which results in the CNN model predicting the whole sentence as positive.

6 Conclusion

In this paper, we propose , a classification and attention model with class activation map, which is a sentiment classification model with word attention based on weakly supervised CNN learning. Although the proposed model is trained based on class labels only, it can not only predict the overall sentiment of a given sentence but also find important emotional words significantly contributing the predicted class. Compared to the previous CNN-based text classification model, utilizes zero-paddings to help the CNN consider every word equally regardless of its position in the sentence. Moreover, it uses average pooling and a large number of filters to preserve the information as much as possible. In addition, various word embedding techniques are employed and integrated.
Experimental results on two movie review datasets, IMDB, which is in English, and WATCHA, which is in Korean, show that the proposed yielded classification accuracies higher than 87% for the IMDB and 78% for the WATCHA dataset. The CNN models that update the word embedding vectors during the sentiment classification learning (CNN-Rand, CNN-Non-Static, CNN-2channel, and CNN-4channel) achieved higher classification performance than that did not update the word embedding vectors (CNN-Static). It is also worth noting that the integration of multiple word embedding techniques improved the classification performance for the IMDB dataset. However, all models showed the ability to find important emotional words in the sentence, although the internal mechanism might be different. For the WATCHA dataset, in particular, the CNN-Static, which does not update the word embedding vector during the training, focused more on generally accepted emotional expressions, whereas the other models, which adapt to the language usage pattern in the movie review domain, seemed to focus more on the domain-dependent emotional expressions.
We expect that the proposed methodology can be a useful application in domains where it is important to understand what the input sentences are intended to convey, such as visual question and answering system or chat bots. Although the experimental results were favorable, the current study has some limitations, which lead us to the future research directions. First, the proposed method used a simple space-based token for training word embedding vectors. If more sophisticated preprocessing techniques, such as lemmatization, are performed, the classification and attention performance can be improved. Secondly, quantitative evaluation of word attention, i.e., how good or appropriate the identified words are in the context of sentiment classification, is difficult, which is why we qualitatively interpreted the word attention results in Section 4. Developing a systematic and quantitative evaluation method for word attention can be another meaningful future research topic.

References