A SentiWordNet Strategy for Curriculum Learning in Sentiment Analysis

05/10/2020
by   Vijjini Anvesh Rao, et al.
IIIT Hyderabad
0

Curriculum Learning (CL) is the idea that learning on a training set sequenced or ordered in a manner where samples range from easy to difficult, results in an increment in performance over otherwise random ordering. The idea parallels cognitive science's theory of how human brains learn, and that learning a difficult task can be made easier by phrasing it as a sequence of easy to difficult tasks. This idea has gained a lot of traction in machine learning and image processing for a while and recently in Natural Language Processing (NLP). In this paper, we apply the ideas of curriculum learning, driven by SentiWordNet in a sentiment analysis setting. In this setting, given a text segment, our aim is to extract its sentiment or polarity. SentiWordNet is a lexical resource with sentiment polarity annotations. By comparing performance with other curriculum strategies and with no curriculum, the effectiveness of the proposed strategy is presented. Convolutional, Recurrence, and Attention-based architectures are employed to assess this improvement. The models are evaluated on a standard sentiment dataset, Stanford Sentiment Treebank.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/19/2021

Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes

While Curriculum Learning (CL) has recently gained traction in Natural l...
08/13/2017

Leveraging Sparse and Dense Feature Combinations for Sentiment Classification

Neural networks are one of the most popular approaches for many natural ...
11/18/2016

Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

Curriculum Learning emphasizes the order of training instances in a comp...
01/25/2021

Curriculum Learning: A Survey

Training machine learning models in a meaningful order, from the easy sa...
12/05/2020

When Do Curricula Work?

Inspired by human learning, researchers have proposed ordering examples ...
11/13/2021

On the Statistical Benefits of Curriculum Learning

Curriculum learning (CL) is a commonly used machine learning training st...
10/25/2020

A Comprehensive Survey on Curriculum Learning

Curriculum learning (CL) is a training strategy that trains a machine le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers from Cognitive Science have established a long time ago that humans learn better and more effectively in an incrementive learning setting [27, 14]. Tasks like playing a piano or solving an equation, are learnt by humans in a strategy where they are first provided easier variants of the main challenge, followed by gradual variation in difficulty. This idea of incremental human learning has been studied for machines as well, specifically in machine learning. Curriculum Learning (CL) as defined by [3] introduce and formulate this concept from Cognitive Science to a machine learning setting. They observe that on shape recognition problem (rectangle, ellipse or triangle), training the model first on a synthetically created dataset with less variability in shape, generalizes faster as compared to directly training on the target dataset. Furthermore, other experiments by [3]

demonstrate performance improvements on a perceptron classifier when incremental learning is done based on the margin in support vector machines (SVM) and a language model task where growth in vocabulary size was chosen as the curriculum strategy. These examples indicate that while curriculum learning is effective, the choice of the curriculum strategy, the basis for ordering of samples is not clear cut and often task specific. Furthermore some recent works like

[21] have suggested that anti curriculum strategies perform better, raising more doubts over choice of strategy. In recent years Self-Paced Learning (SPL) [15, 12] has been proposed as a reformulation of curriculum learning by modeling the curriculum strategy and the main task in a single optimization problem.

Sentiment Analysis (SA) is a major challenge in Natural Language Processing. It involves classifying text segments into two or more polarities or sentiments. Prior to the success of Deep Learning (DL), text classification was dealt using lexicon based features. However sentiment level information is realized at more levels than just lexicon or word based. For a model to realize a negative sentiment for “not good”, it has to incorporate sequential information as well. Since the advent of DL, the field has been revolutionized. Long Short Term Memory

[10, 20]

(LSTM), Convolutional Neural Networks (CNN)

[13, 19] and Attention based architectures [32] have achieved state-of-art results in text classification and continue to be strong baselines for text classification and by extension, Sentiment Analysis. Sentiment Analysis, further aids other domains of NLP such as Opinion mining and Emoji Prediction[4].

Curriculum Learning has been explored in the domain of Computer Vision (CV) extensively

[16, 11, 18] and has gained traction in Natural Language Processing (NLP) in tasks like Question Answering [25, 26] and Natural Answer Generation [17]. In Sentiment Analysis, [5] propose a strategy derived from sentence length, where smaller sentences are considered easier and are provided first. [9] provide a tree-structured curriculum based on semantic similarity between new samples and samples already trained on. [31] suggest a curriculum based on hand crafted semantic, linguistic, syntactic features for word representation learning. However, these CL strategies pose the easiness or difficulty of a sample irrespective of sentiment. While their strategies are for sentiment analysis, they do not utilize sentiment level information directly in building the order of samples. Utilizing SentiWordNet, we can build strategies that are derived from sentiment level knowledge.

SentiWordNet [8, 1] is a highly popular word level sentiment annotation resource. It has been used in sentiment analysis and related fields such as opinion mining and emotion recognition. This resource was first created for English and due to its success it has been extended to many other languages as well [6, 7, 23, 24]. This lexical resource assigns positivity, negativity and derived from the two, an objectivity score to each WordNet synset [22]. The contributions of the paper can be summarized as follow:

  • We propose a new curriculum strategy for sentiment analysis (SA) from SentiWordNet annotations.

  • Existing curriculum strategies for sentiment analysis rank samples with a difficulty score impertinent to the task of SA. Proposed strategy ranks samples based on how difficult assigning them a sentiment is. Our results show such a strategy’s effectiveness over previous work.

2 Problem Setting for Curriculum Learning

While Curriculum Learning as defined by [3] is not constrained by a strict description, later related works [5, 9, 29] make distinctions between Baby Steps curriculum and One-Pass curriculum. Since, these previous works have also shown the dominance of Baby Steps over One-Pass, we use the former for proposed SentiWordNet driven strategy. Baby Steps curriculum algorithm can be defined as following. For every sentence , its sentiment is described as ***Our dataset has 5 labels., where for data points in . For a model , its prediction based on will be . Loss is defined on the model prediction and actual output as and the net cost for the dataset is defined as as . Then the task is modelled by

(1)

Where can be a regularizer. In this setting, Curriculum Learning is defined by a Curriculum Strategy . defines an “easiness” quotient of sample . If the model is currently trained on . Then sample is chosen based on as:

(2)

Sample is then added to the new training set or Adding one sample at a time can be a very slow process, hence we add in batches. For our experiments, we take a batch size of samples with lowest to add at once. and the process continues until training is done on all the sentences in . The process starts with first training on a small subset of , which have least score. In this way incremental learning is done in Baby Steps.

3 Experiments

3.1 Dataset

Following previous works in curriculum driven sentiment analysis [5, 9, 31] We use the Stanford Sentiment Treebank (SST) dataset [28]https://nlp.stanford.edu/sentiment/. Unlike most sentiment analysis datasets with binary labels, SST is for a 5-class text classification which consists of 8544/1101/2210 samples in train, development and test set respectively. We use this standard split with reported results averaged over 10 turns.

Model Curriculum Strategies
SentiWordNet Sentence Length No Curriculum
Kim CNN 41.55 40.81 40.59
LSTM 44.54 43.89 41.71
LSTM+Attention 45.27 42.98 41.66
Table 1: Accuracy scores in percentage of all models on different strategies

3.2 Architectures

We test our curriculum strategies on popular recurrent and convolutional architectures used for text classification. It is imperative to note that curriculum strategies are independent of architectures, they only decide the ordering of samples for training. The training itself could be done with any algorithm.

Kim CNN

This baseline is based on the deep CNN architecture [13] highly popular for text classification.

Lstm

We employ Long Short Term Memory Network (LSTM) [10]

for text classification. Softmax activation is applied on the final timestep of the LSTM to get final output probability distributions. Previous approach

[5] uses LSTM for this task as well with sentence length as curriculum strategy.

LSTM + Attention

In this architecture we employ attention mechanism described in [2] over the LSTM outputs to get a single context vector, on which softmax is applied. In this baseline, attention mechanism is applied on the top of LSTM outputs across different timesteps. Attention mechanism focuses on most important parts of the sentence that contribute most to the sentiment, especially like sentiment words.

3.3 Implementation Details

We used GloVe pretrained word vectors§§§https://nlp.stanford.edu/data/glove.840B.300d.zip for input embeddings on all architectures. The size of the word embeddings in this model is 300. A maximum sentence length of 50 is considered for all architectures. Number of filters taken in the CNN model is 50 with filter size as 3, 4, 5. We take number of units in the LSTM to be 168, following previous work [30]

for empirical setup. For the LSTM + Attention model, we take number of units in the attention sub network to be 10. Categorical crossentropy as the loss function and Adam with learning rate of 0.01 as optimizer is used. The batch size

defined in curriculum learning framework is for Sentence Length Strategy in LSTM and LSTM+Attention, and for CNN. For SentiWordNet strategy, it is for LSTM and LSTM+Attention, and for CNN.

3.4 Curriculum Strategies

In this section we present the proposed SentiWordNet driven strategy followed by a common strategy based on sentence length.

3.4.1 SentiWordNet driven Strategy

We first train an auxiliary feed forward model for sentiment analysis on the same dataset utilizing only SentiWordNet features. This allows us to find out which training samples are actually difficult. Following are the features we use for the auxiliary model:

  • Sentence Length : For a given sentence, this feature is just the number of words after tokenization.

  • Net Positivity Score : For a given sentence, this is the sum of all positivity scores of individual words.

  • Net Negativity Score : For a given sentence, this is the sum of all negativity scores of individual words.

  • Net Objectivity Score : For a given sentence, this is the sum of all objectivity scores of individual words.Note that for an individual word, The Objectivity score is just 1 - Negativity Score - Positivity Score. This feature is meant to show how difficult it is to tell the sentiment of a sentence.

  • Abs. Difference Score: This score is the absolute difference between Net Positivity and Net Negativity Scores or . This feature is meant to reflect overall sentiment of the sentence.

  • Scaled Positivity: Since the Net Positivity may increase with number of words, we also provide the scaled down version of the feature or .

  • Scaled Negativity: For the same reason as above, we also provide .

  • Scaled Objectivity: Objectivity scaled down with sentence length or .

  • Scaled Abs Difference: Abs. Difference scaled down with sentence length or .

Since all the features lie in very different ranges, before passing for training to the architectures they are normalized between and first with mean

. Also important to note is that, the SentiWordNet scores are for a synset and not for a word. In essence, a word may have multiple scores. In such cases, positivity, negativity and objectivity scores are averaged for that word. We use a simple feed forward network to train this auxiliary model with final layer as a softmax layer

The number of layer units are as follows: .. We get an accuracy of just on this model, significantly lesser than performances we see by LSTM and CNN in No Curriculum setting as seen in Table 1. But the performance doesn’t actually matter. From this model, we learn what samples are the most difficult to classify and what are the easiest. For all 8544 training samples of , we define the curriculum score as follows:

(3)

where is the prediction of auxiliary model on sentence , is the iterator over the number of classes . In essence, we find the mean squared error between the prediction and the sentence’s true labels. If is high, it implies the sentence is hard to classify and if less, then the sentence is easy. Because the features were trained on an auxiliary model from just SentiWordNet features, we get an easiness-difficulty score purely from the perspective of sentiment analysis.

3.4.2 Sentence Length

This simple strategy tells that, architectures especially like LSTM find it difficult to classify sentences which are longer in length. And hence, longer sentences are difficult and should be ordered later. Conversely shorter sentence lengths are easier and should be trained first. This strategy is very common and has not only been used in sentiment analysis [5]******[5] have done CL on SST as well, however our numbers do not match because they use the phrase dataset which is much larger. but also in dependency parsing [29]. Which is why it becomes a strong baseline especially to evaluate the importance of SentiWordNet driven strategy.

4 Results

We report our results in Table 1. As evident from the table, we see that proposed SentiWordNet based strategy beats Sentence Length driven and No Curriculum always. However, the difference between them is quite less for the CNN model. This must be because, CNN for this dataset is worse of all other models without curriculum, this architecture finds it difficult to properly classify, let alone fully exploit curriculum strategies for better generalization. Furthermore, another reason behind effectiveness of Sentence Length strategy for LSTM and LSTM+Attention is that, considering the LSTM’s structure which observes one word at a time, its only natural that longer sequences will be hard to remember, hence Sentence Length ordering acts as a good curriculum basis. This idea has also been referenced by previous works such as [5]. Since Attention mechanism observes all time steps of the LSTM, the difficulty in longer sentence lengths diminishes and hence the improvement in performance with Sentence Length strategy is lesser as compared to LSTM. Sentence Length driven Strategy, while performing better in LSTM and LSTM+Attention model, is still less than SentiWordNet, this is because sentence length strategy defines difficulty and easiness in a more global setting, not specific to sentiment analysis. However, with SentiWordNet we define a strategy which characterizes the core of curriculum learning in sentiment analysis, namely the strategy for ranking samples based solely on how difficult or easy it is to classify the sample into predefined sentiment categories.

5 Conclusion

In this paper, we define a SentiWordNet driven strategy for curriculum learning on sentiment analysis task. The proposed approach’s performance is evident on multiple architectures, namely recurrent, convolution and attention based proving the robustness of the strategy. This approach also shows the effectiveness of simple lexicon based annotations such as SentiWordNet and how they can be used to further sentiment analysis. Future works could include strategies that consecutively enrich SentiWordNet as well and also those that can refine the resource by pointing out anomalies in the annotation.

References

  • [1] S. Baccianella, A. Esuli, and F. Sebastiani (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining.. In Lrec, Vol. 10, pp. 2200–2204. Cited by: §1.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1, §2.
  • [4] N. Choudhary, R. Singh, V. A. Rao, and M. Shrivastava (2018) Twitter corpus of resource-scarce languages for sentiment analysis and multilingual emoji prediction. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1570–1577. Cited by: §1.
  • [5] V. Cirik, E. Hovy, and L. Morency (2016) Visualizing and understanding curriculum learning for long short-term memory networks. arXiv preprint arXiv:1611.06204. Cited by: §1, §2, §3.1, §3.2, §3.4.2, §4, footnote **.
  • [6] A. Das and S. Bandyopadhyay (2010) SentiWordNet for indian languages. In Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 56–63. Cited by: §1.
  • [7] A. Das and S. Bandyopadhyay (2010) Towards the global sentiwordnet. In Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, pp. 799–808. Cited by: §1.
  • [8] A. Esuli and F. Sebastiani (2006) Sentiwordnet: a publicly available lexical resource for opinion mining.. In LREC, Vol. 6, pp. 417–422. Cited by: §1.
  • [9] S. Han and S. Myaeng (2017) Tree-structured curriculum learning based on semantic similarity of text. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 971–976. Cited by: §1, §2, §3.1.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §3.2.
  • [11] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 547–556. Cited by: §1.
  • [12] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann (2015) Self-paced curriculum learning. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • [13] Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §1, §3.2.
  • [14] K. A. Krueger and P. Dayan (2009) Flexible shaping: how learning in small steps helps. Cognition 110 (3), pp. 380–394. Cited by: §1.
  • [15] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §1.
  • [16] Y. J. Lee and K. Grauman (2011) Learning the easy things first: self-paced visual category discovery. In CVPR 2011, pp. 1721–1728. Cited by: §1.
  • [17] C. Liu, S. He, K. Liu, and J. Zhao (2018) Curriculum learning for natural answer generation.. In IJCAI, pp. 4223–4229. Cited by: §1.
  • [18] J. Louradour and C. Kermorvant (2014) Curriculum learning for handwritten text line recognition. In 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 56–60. Cited by: §1.
  • [19] A. Madasu and V. A. Rao (2019) Gated convolutional neural networks for domain adaptation. In International Conference on Applications of Natural Language to Information Systems, pp. 118–130. Cited by: §1.
  • [20] A. Madasu and V. A. Rao (2019) Sequential learning of convolutional features for effective text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5662–5671. Cited by: §1.
  • [21] B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2018) The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §1.
  • [22] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
  • [23] S. Parupalli, V. A. Rao, and R. Mamidi (2018) Bcsat: a benchmark corpus for sentiment analysis in telugu using word-level annotations. arXiv preprint arXiv:1807.01679. Cited by: §1.
  • [24] S. Parupalli, V. A. Rao, and R. Mamidi (2018) Towards enhancing lexical resource and using sense-annotations of ontosensenet for sentiment analysis. arXiv preprint arXiv:1807.03004. Cited by: §1.
  • [25] M. Sachan and E. Xing (2016) Easy questions first? a case study on curriculum learning for question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 453–463. Cited by: §1.
  • [26] M. Sachan and E. Xing (2018) Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 629–640. Cited by: §1.
  • [27] B. F. Skinner (1958) Reinforcement today.. American Psychologist 13 (3), pp. 94. Cited by: §1.
  • [28] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §3.1.
  • [29] V. I. Spitkovsky, H. Alshawi, and D. Jurafsky (2010) From baby steps to leapfrog: how less is more in unsupervised dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 751–759. Cited by: §2, §3.4.2.
  • [30] K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. Cited by: §3.3.
  • [31] Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, and C. Dyer (2016) Learning the curriculum with bayesian optimization for task-specific word representation learning. arXiv preprint arXiv:1605.03852. Cited by: §1, §3.1.
  • [32] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §1.