Leveraging Sparse and Dense Feature Combinations for Sentiment Classification

08/13/2017 ∙ by Tao Yu, et al. ∙ Columbia University 0

Neural networks are one of the most popular approaches for many natural language processing tasks such as sentiment analysis. They often outperform traditional machine learning models and achieve the state-of-art results on most tasks. However, many existing deep learning models are complex, difficult to train and provide a limited improvement over simpler methods. We propose a simple, robust and powerful model for sentiment classification. This model outperforms many deep learning models and achieves comparable results to other deep learning models with complex architectures on sentiment analysis datasets. We publish the code online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models have been applied to tasks in natural language processing in recent years and have achieved very good performance. In particular, for many sentiment analysis tasks, recurrent neural networks with many hidden layers are the current state-of-the-art. These models perform well because they can account for the context in which words appear and are able to model long-range dependencies which may modify the sentiment of a statement within a sentence.

However, these models are complex with many parameters. As a result, the training time is computationally intensive and may only be accessible to organizations with significant computing resources. Furthermore, due to the many parameters in a neural network, it is easy to overfit to training data without expert knowledge of how to tune parameters.

On the other hand, one of the best performing linear models is the Naive Bayes Support Vector Machine (NBSVM)

[Wang and Manning2012]

, which is a bag-of-words (BoW) model using only n-gram features. Among traditional machine learning methods this model performs well but compared to many deep learning models it has significantly lower results. In other work, word embeddings have been shown to be a useful resource for many natural language processing tasks. Some research has shown how to use word embeddings in traditional models. However, there have not been many models using the combination of word embeddings and bag-of-words features. We provide a simple method for using word embeddings in conjunction with NBSVM that outperforms more sophisticated methods.

We show that the performance of a linear bag-of-words model increases by naively combining n-grams and word embeddings. We obtain additional improvements by grouping word embeddings according to part-of-speech (POS) tags. Finally, we discuss other complex ways of incorporating word embeddings that do not improve performance.

Section 2 will describe previous work in sentiment analysis. Sections 3 and 4 describe the datasets used for experiments and an explanation of the methodology. Then we present results in Section 5 showing near state-of-the-art performance in many tasks and outperforming several more complicated models. Finally, we discuss our findings and future directions for this work in Section 6.

2 Related Work

Sentiment analysis has a rich literature of machine learning models. Within the deep learning framework, recurrent, recursive, and convolutional networks have all been used. The current state of the art on many sentiment tasks is the AdaSent model [Zhao et al.2015]. This model forms a hierarchy of representations from words to phrases and then to sentences through a recursive gated local composition of adjacent segments using recurrent and recursive neural networks before, which leads into a gating network to form an ensemble. One advantage of this model is that the binary parse trees for the recursive network are learned adaptively, rather than using pre-trained parses. Using syntactic trees for recursive neural networks was first shown to perform well on sentiment analysis tasks, given labels at every node in the tree [Socher et al.2013]. Dai and Le Dai:2015:SSL:2969442.2969583 used semi-supervised Sequence Learning with LSTM recurrent networks.

In the one of the first uses of convolutional neural networks for sentiment analysis, researchers used static and non-static channels for fixed and dynamic embeddings (allowing the embeddings to vary during training time)

[Kim2014]. In other work [Zhou et al.2016]

, researchers use a bi-directional LSTM over sentences, then treat the resulting matrix as an image and use 2-dimensional CNNs followed by max pooling (BLSTM-2DPooling). Another recent model, Dependency Sensitive Convolutional Neural Networks (DSCNN) hierarchically builds text representations from root-level labels using LSTMs and CNNs

[Zhang et al.2016].

In the area of target-dependendent sentiment analysis, initial work used an SVM with dependency parse features [Jiang et al.2011]. Additional work also used LSTMs modified to be target-specific [Dong et al.2014] [Vo and Zhang2015] [Tang et al.2015].

In other work not using neural networks, NBSVM adjusts the binary counts of words in a bag-of-words approach by weighting each word according to the ratio of its counts in positive and negative documents [Wang and Manning2012]. More recently, some experiments have been done to combine bag-of-words vectors with word embeddings, using the most frequent 30,000 uni-grams and bi-grams and concatenating them with averaged word embeddings [Zhang and Wallace2015]. Other research involved learning weights for embeddings [Li et al.2016] using a Naive Bayes approach.

3 Data

We test our model on several sentiment analysis datasets: the Movie Review dataset (MR), the Customer Review corpus (CR), the MPQA Opinion corpus (MPQA), the subjectivity dataset (Subj), reviews from the Internet Movie Database (IMDB), and a Twitter sentiment dataset (TTwt). The MR dataset was created from short movie reviews, with one sentence per review [Pang and Lee2005]. Another dataset (CR) was created by crawling customer reviews from the Web for 5 technology products [Hu and Liu2004] and 9 other technology products, combined following [Nakagawa et al.2010]. For MPQA, we focus on the opinion polarity subtask [Wiebe and Cardie2005]. The goal of the subjectivity dataset (Subj) is to distinguish between subjective reviews and objective plot summaries [Pang and Lee2004]. The IMDB corpus consists of full-length reviews [Maas et al.2011] and the Twitter dataset (TTwt) [Jiang et al.2011] contains sentiment dependent on a particular target. Detailed statistics of the datasets are shown in Table 1.

Data Classes Length (, , ) Vocabulary Size Testing
MR 2 20 (5331, 5331, -) 21000 CV
CR 2 20 (2406, 1366, -) 5713 CV
MPQA 2 3 (3316, 7308, -) 6299 CV
Subj 2 24 (5000, 5000, -) 24000 CV
IMDB 2 231 (25000, 25000, -) 392000 N
TTwt 3  13 (1562, 1562, 3124)  10000 N
Table 1: Dataset statistics. Length: Average number of words. (, , ): number of positive, negative and neutral examples.Testing: CV for reporting results of cross-validation, N for reporting results on heldout data.

4 Methods

Our model is the Naive Bayes Logistic Regression with word embedding features (NBLR + POSwemb) model, which is an extension of NBSVM. To leverage sparse and dense feature combinations, our NBLR + POSwemb model uses the following features:

  • Sentiment and negation word indicators: For each sentence, we first append positive or negative sentiment indicator tokens at the end of each sentence if it includes some words in the MPQA [Wilson et al.2005] and Liu [Hu and Liu2004]

    sentiment lexicons. Also, we apply the same step if the sentence contains negation words and adversatives.

  • Log-count ratios for multiclasses: After adding sentiment and negation word indicator to the sentence , we compute log-ratio vectors for multiclasses. The count vectors for documents with label are and for documents with other labels are where is a smoothing parameter (here we set = 1). The log-count ratio for class is then:

    is a count vector for training case with label . In this work, we concatenate the log-ratio vector of each class to obtain the final sparse vector .

  • Averaged and part-of-speech (POS) word embedding features: For a sentence , first, we compute the averaged word embedding of all words . Then we POS tag all words in the sentence and group them into the set = {NOUN, VERB, ADJECTIVE} according to their POS tags. For each of the MR, CR, MPQA, Subj, and IMDB datasets, we use the NLTK part-of-speech tagger222http://www.nltk.org/. For the TTwt dataset, we choose CMU Ark Tweet NLP tagging tool333https://github.com/brendano/ark-tweet-nlp. The part-of-speech (POS) word embedding vector by averaging the word vectors of words which are tagged as :

    where is the indicator feature and is the pre-trained word vector for the word . The final dense vector is the concatenation all the vectors and 444For the target dependent Twitter sentiment analysis task (TTwt), we also add an additional averaged word embedding vector for the target.

Finally, we combine the sparse and dense feature vectors to generate the final vector input of each sentence

to a logistic regression classifier.

We use the same pre-processing steps in (Kim, 2014) and the Google word2vec pre-trained word embeddings to compute the average vectors of words in each group 555https://code.google.com/archive/p/word2vec/.

Data MR CR MPQA Subj TTwt IMDB
NBSVM (Wang and Manning, 2012) 79.4 81.8 86.3 93.2 65.6 91.2
NBLR + POSwemb 81.6 84.0 89.9 93.3 69.9 91.8
Li et al., 2016 79.5 81.1 82.1 92.8 93.0
DSCNN(Zhang et al., 2016) 81.5 93.2 90.2
BLSTM-2DPooling(Zhou et al., 2016) 81.5 93.7
SA-LSTM (Dai and Le, 2015) 80.7 92.8
AdaSent (Zhao et at., 2015) 83.1 86.3 93.3 95.5
CNN non-static (Kim, 2014) 81.2 84.0 89.6 93.4
bowwvSVM (Zhang and Wallace, 2016) 79.67 81.3 89.7 91.7
SVM-dep (Jiang et al., 2011) 63.3
AdaRNN comb (Dong et al., 2014) 65.9
Targ-dept+(Vo Zhang et al., 2015) 69.9
TC-LSTM (Tang et al., 2015) 69.5
Table 2: Results Compared to other Models. The bold and underlined, and bold-only numbers are the best and second best results respectively. Our NBLR + POSwemb performs the best or second on several datasets.

5 Results and Discussion

Results of our experiments are shown in Table 4 To report the results, we use either 10-fold cross-validation or train/test split depending on what is standard for the dataset. The testing column of Table 1 specifies which method is used. All results are reported in terms of accuracy, except for TTwt, where we report macro F-measure.

Compared to NBSVM, the performance of the NBLR + POSwemb model increases by 2-3%. Our simple model outperforms all other recent complex neutral network models except AdaSent [Zhao et al.2015] and achieves state-of-the-art performance on most of benchmarks.

6 Conclusion

In this paper we have presented a new powerful model built on top of NBSVM. Using linear models with features derived from n-grams and embeddings, we are able to obtain near state-of-the-art results. These simple models provide a straightforward way for practitioners to create models and run experiments on new tasks.

In the future, we plan to experiment more with ways of combining word embeddings. Part-of-speech tags provide useful indicators of sentiment because separating nouns, verbs, and adjectives captures some high-level information about what the sentence is about in terms of subjects and objects. We may find that other word-level tags or constituent and dependency parse tree information is useful as well.

Models with Naive Bayes weighting, indicators, and word embeddings achieve near state-of-the-art scores on many sentiment benchmark datasets. Unlike other state-of-the-art models, our model is simple and fast to train compared to complex deep learning architectures, using only transformed uni/bi-grams and word embedding features.

References

  • [Dai and Le2015] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 3079–3087, Cambridge, MA, USA. MIT Press.
  • [Dong et al.2014] Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive recursive neural network for target-dependent twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 49–54, Baltimore, Maryland, June. Association for Computational Linguistics.
  • [Hu and Liu2004] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177, New York, NY, USA. ACM.
  • [Jiang et al.2011] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 151–160, Portland, Oregon, USA, June. Association for Computational Linguistics.
  • [Kim2014] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Computational Linguistics.
  • [Li et al.2016] Bofang Li, Zhe Zhao, Tao Liu, Puwei Wang, and Xiaoyong Du. 2016. Weighted neural bag-of-n-grams model: New baselines for text classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1591–1600, Osaka, Japan, December. The COLING 2016 Organizing Committee.
  • [Maas et al.2011] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June. Association for Computational Linguistics.
  • [Nakagawa et al.2010] Tetsuji Nakagawa, Kentaro Inui, and Sadao Kurohashi. 2010. Dependency tree-based sentiment classification using crfs with hidden variables. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 786–794, Los Angeles, California, June. Association for Computational Linguistics.
  • [Pang and Lee2004] Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 271–278, Barcelona, Spain, July.
  • [Pang and Lee2005] Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of ACL, pages 115–124.
  • [Socher et al.2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October. Association for Computational Linguistics.
  • [Tang et al.2015] Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1422–1432, Lisbon, Portugal, September. Association for Computational Linguistics.
  • [Vo and Zhang2015] Duy-Tin Vo and Yue Zhang. 2015. Target-dependent twitter sentiment classification with rich automatic features. In

    Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015

    , pages 1347–1353.
  • [Wang and Manning2012] Sida Wang and Christopher Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 90–94, Jeju Island, Korea, July. Association for Computational Linguistics.
  • [Wiebe and Cardie2005] Janyce Wiebe and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. language resources and evaluation. In Language Resources and Evaluation (formerly Computers and the Humanities, page 2005.
  • [Wilson et al.2005] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 347–354, Vancouver, British Columbia, Canada, October. Association for Computational Linguistics.
  • [Zhang and Wallace2015] Ye Zhang and Byron C. Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820.
  • [Zhang et al.2016] Rui Zhang, Honglak Lee, and Dragomir R. Radev. 2016. Dependency sensitive convolutional neural networks for modeling sentences and documents. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1512–1521, San Diego, California, June. Association for Computational Linguistics.
  • [Zhao et al.2015] Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptive hierarchical sentence model. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 4069–4076.
  • [Zhou et al.2016] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text classification improved by integrating bidirectional lstm with two-dimensional max pooling. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3485–3495, Osaka, Japan, December. The COLING 2016 Organizing Committee.