Deep Learning (DL) has radically changed the rules of the game in NLP by boosting performance figures in almost all applications areas. Yet in contrast to more conventional techniques, such as -gram based linear models, neural methodologies seem to rely on vast amounts of training data—as is obvious in areas such as machine translation or word representation learning Vaswani et al. (2017); Mikolov et al. (2013).
With this profile, DL seems ill suited for many prediction tasks in Sentiment and Subjectivity Analysis Balahur et al. (2014). For the widely studied problem of polarity prediction in social media (positive vs. negative emotion or evaluation, only; Rosenthal17), training data is relatively abundant. However, annotating for more complex representations of affective states—such as Basic Emotions Ekman (1992) or Valence-Arousal-Dominance Bradley and Lang (1994)—seems to be significantly harder in terms of both time consumption and inter-annotator agreement (IAA) Strapparava and Mihalcea (2007). Nevertheless, these more complex models of emotion rapidly gained popularity in recent years due to their increased expressiveness Wang et al. (2016); Buechel and Hahn (2017); Sedoc et al. (2017).
For the social media domain, this lack of gold data can be partly countered by (pre-) training with distant supervision which uses signals such as emojis or hashtags as a surrogate for manual annotation Mohammad and Kiritchenko (2015); Felbo et al. (2017); Abdul-Mageed and Ungar (2017). Yet this procedure is less appropriate for other target domains as well as for predicting other subjective phenomena such as empathy, epistemic modality or personality Khanpour et al. (2017); Rubin (2007); Liu et al. (2017). These problems only intensify for under-resourced languages.
Besides pre-training the entirety of the model with distant supervision, an alternative strategy is pre-training word representations, only. This approach is feasible for a wide range of languages since raw text is much more readily available than gold data, e.g., through Wikipedia Grave et al. (2018). Unfortunately, it has been frequently argued that pre-trained embeddings are ill-suited for sentiment and emotion analysis since they do not capture sufficient affective information. This has been illustrated by word pairs like good and bad
which have highly similar vector representations but opposing polarityTang et al. (2014); Yu et al. (2017); Khosla et al. (2018). However, to the best of our knowledge, no experimental data have been provided in support of this claim.
Contribution. Both claims, the need for large amounts of gold data and the lack of affective information in pre-trained word embeddings, may largely impede the feasibility of DL in low-resource scenarios. Yet, in this paper, we provide strong, first-time evidence that both, in actuality, turn out to be misconceptions. Our experimental results from three typologically diverse languages indicate that sophisticated DL architectures can be fitted on surprisingly little gold data and that pre-trained word embeddings are instrumental for achieving strong performance despite such data constrains. This contribution thus opens up new application areas for DL especially in underresourced languages and other low-resource environments given that sufficient unlabeled data are available. Our best performing model achieves super-human, state-of-the-art results on the popular SemEval 2007 corpus Strapparava and Mihalcea (2007).
|SE07||Inter Milan set Serie A win record|
|ANET||The dog strains forward, snarling, and suddenly leaps out at you.|
|ANPST||Decyzje podjete w przeszłość i kształtuja nasza teraźniejszość. ‘Decisions made in the past shape our present.’|
|MAS||A praia é espetacular. ‘The beach is spectacular.’|
|Corpus||Language||Size||Annotation||Emb. Data||Emb. Alg.||Emb. Size|
|MAS||Portuguese||192||VAD + BE5||Wikipedia||FastText||4B|
For our study, we selected corpora of small size () where each instance bears numerical ratings regarding multiple emotion variables.111The latter restriction was established to permit multitask learning (Section 3) and is also the reason why the corpus by Mohammad17wassa was not included. According to these criteria, we came up with the following four data sets covering three typologically diverse languages (exemplary entries in Table 1).
SE07: The test set of SemEval 2007 Task 14 Strapparava and Mihalcea (2007) comprises 1000 English news headlines which are annotated according to six Basic Emotions, joy, anger, sadness, fear, disgust, and surprise on a -scale (BE6 annotation format).
ANET: The Affective Norms for English Text Bradley and Lang (2010) are an adaptation of the popular lexical database ANEW Bradley and Lang (1999) to short texts. The corpus comprises 120 situation description which are annotated according to Valence, Arousal, and Dominance on a 9-point scale (VAD annotation format).
ANPST and MAS: The Affective Norms of Polish Short Texts Imbir (2017)) and the Minho Affective Sentences Pinheiro et al. (2017) can be seen as loose adaptations of ANET, very similar in methodology, but different in size and linguistic characteristics (see Table 1). Both are annotated according to VAD. Additionally MAS is also annotated according the the first five Basic Emotions (omitting ‘surprise’) on a 5-point scale (BE5).
To increase both performance and reproducibility we employ pre-trained, publicly available word embeddings. We rely mostly on FastText vectors Bojanowski et al. (2017), yet for SE07 we use the word2vec embeddings222 code.google.com/archive/p/word2vec/ Mikolov et al. (2013) trained on similar data than SE07 comprises (newswire material). For ANET, we rely on the FastText embeddings trained on Common Crawl Mikolov et al. (2018). For ANPST and MAS, we use the FastText embeddings by Grave18 trained on the respective Wikipedias. An overview of our corpora and embedding models is given in Table 2.
We provide two distinct linear baseline models which both rely on Ridge regression, an
-regularized version of linear regression. The first one, Ridge, is based on -gram features where we use . The second one, Ridge uses bag-of-vectors
features, i.e., the pointwise mean of the embeddings of the words in a text. Regarding the deep learning approaches, we compare Feed-Forward Networks (FFN), Gated Recurrent Unit Networks (GRU), Long Short-Term Memory Networks (LSTM), Convolutional Neural Networks (CNN), as well as a combination of the latter two (CNN-LSTM)Cho et al. (2014); Hochreiter and Schmidhuber (1997); Kalchbrenner et al. (2014).
Since holding out a dev set from the already extremely limited training data is not feasible, we decided to instead use constant hyperparameter settings across all corpora, thus also demonstrating the robustness of our models (see Section4). Moreover, a large number of hyperparameters will even be held constant across different model architectures. These universal settings are as follows:
|Ridge||-gram features with ; feature normalization; automatically chosen regularization coefficient from|
|Ridge||bag of vectors-features; regularization coefficient chosen as in ‘Ridge’|
|FFN||bag of vectors-features; two dense layers (256 and 128 units)|
one conv. layer (filter size 3, 128 channels), max-pooling layer with .5 dropout; dense layer (128 units)
|GRU||recurrent layer (128 units, uni-directional); last timestep receives .5 vertical dropout and is fed into a dense layer (128 units)|
|LSTM||identical to ‘GRU’|
conv. layer as in ‘CNN’; max-pooling layer (pool size 2, stride size 1) with .5 dropout; LSTM identical to ‘GRU’
The input to our DL models is based on pre-trained word vectors of 300 dimensions. ReLu activation was used everywhere except in recurrent layers. Dropout is used for regularization with a probability of .2 for embedding layers and .5 for dense layers following the recommendations by Srivastava14. We use .5 dropout also on other types of layers where it would conventionally be consider too high (e.g. on max pooling layers). Our models are trained for 200 epochs using the Adam optimizerKingma and Ba (2015) with fixed learning rate of and batch size of 32.333 Training each of the individual models took about a minute on a GeForce GTX 1080 Ti, at most. Word embeddings were not updated during training. Since, in compliance with our gold data, we treat emotion analysis as regression problem Buechel and Hahn (2016) the output layers of our models consist of an affine transformation, i.e., a dense layer without non-linearity.
To reduce the risk of overfitting on such small data sets, we used relatively simple models both in terms of number of layers and units in them (mostly 2 and 128, respectively). Moreover, our models have one distinct output neuron for each variable of the respective annotation format (e.g., 3 for VAD). Yet the weights and biases of allhidden layers are shared across the outputs. Arguably, this set-up qualifies as a mild form of multi-task learning Caruana (1997)
, a machine learning techniques which has been shown to greatly decrease the risk of overfittingBaxter (1997) and to work well for various NLP tasks Søgaard and Goldberg (2016); Peng et al. (2017).
Performance will be measured as Pearson correlation between the predicted values and human gold ratings (one -value per variable of the target representation, often averaged over all of them).
Conventional 10-fold cross-validation (CV) would lead to very small test splits (only 12 instances in the case of ANET) thus causing high variance between the individual splits and, ultimately, even regarding the average of all 10 runs. Therefore, werepeat 10-fold CV ten times (-CV) with different data splits, then averaging the results Dietterich (1998). To further increase reliability, identical data splits were used for each of the approaches under comparison.
We treat the VAD and the BE5 ratings of the MAS corpus as two different data sets (MAS and MAS), leading to a total of 5 conditions (see Table 4). Overall, the DL approaches yield a satisfying performance of at least as average over all corpora, despite the small data size. All of them massively outperform Ridge which represents more conventional methodologies popular before the wide adaptation of embedding- and DL-based approaches. The results are especially good for GRU, LSTM, CNN-LSTM and FFN, each one with an average performance of . Overall, the GRU performs best—being superior in all but one condition where the FFN comes out on top. Perhaps surprisingly, also Ridge performs very competitive. Given its low computational cost and its robustness across data sets, our results indicate that this model constitutes an excellent baseline. It also suggests that the high quality of the pre-trained embedding models may be one of the key-factors for our generally very strong results because Ridge heavily relies on lexical signals. In line with that, we found in a supplemental experiment that not using pre-trained embeddings but instead learning them during training significantly reduces performance, e.g., by over 15%-points for the GRU on SE07.
We now compare our best performing model against previously reported results for the SE07 corpus. Table 5 provides the performance of the winning system of the original shared task (Winner; Chaumartin07), the IAA as reported by the organizers Strapparava and Mihalcea (2007), the performance by Beck17, the highest one reported for this data set so far (Beck), as well as the results of our GRU from the -CV.
As can be seen, the GRU established a new state-of-the-art result and even achieves super-human performance. This may sound improbable at first glance. However, Strapparava07 employ a rather weak notion of human performance which is—broadly speaking—based on the reliability of a single human rater.555 Instead, other approaches to IAA computation for numerical values, such as split-half or inter-study reliability, constitute a more challenging comparison since they are based on the reliability of many raters, not one Mohammad and Bravo-Marquez (2017a); Buechel and Hahn (2018).
Interestingly, the GRU shows particularly large improvements over human performance for categories where the IAA is low (anger, disgust, and surprise) which might be an effect of the additional supervision introduced by multi-task learning.
Training Size vs. Model Performance.
In our last analysis, again focusing on the SE07 corpus, we examine the behavior of our full set of models when varying the amount of training data. For each number , we randomly sampled instances of the entirety of the corpus for training and tested on the held out data. This procedure was repeated 100 times for each of the training data sizes before averaging the results. Each of the models was evaluated with the identical data splits. The outcome of this experiment is depicted in Figure 1.
As can be seen, recurrent models suffer only a moderate loss of performance down to a third of the original training data (about 300 observations). The CNN, FFN and Ridge model remain stable even longer—their performance only begins to decline rapidly at about 100 instances. Astonishingly, the CNN achieves human-performance even with as little 200 training samples. In contrast, Ridge declines more steadily yet its overall performance on larger training sets is much lower.
We provided the first examination of DL for emotion analysis under extreme data limitations. We compared popular architectures such as GRU and CNN-LSTM on four topologically diverse data sets of sizes ranging between 1000 and only 120 instances. Counterintuitively, we found that all DL approaches performed well under every experimental condition. Our proposed GRU model even established a novel state-of-the-art result on the SemEval 2007 test set Strapparava and Mihalcea (2007) outperforming human reliability. Moreover, it has been frequently argued that pre-trained word embeddings do not comprise sufficient affective information to be used verbatim in emotion analysis. We here provided evidence that in actuality the opposite holds—high-quality pre-trained word embeddings are instrumental in achieving strong results in low-resource scenarios and largely boost performance independent of model type. Hence, this contribution pointed out two obstructive misconceptions thus opening up DL for applications in low-resource scenarios.
Abdul-Mageed and Ungar (2017)
Muhammad Abdul-Mageed and Lyle Ungar. 2017.
EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 718–728.
Balahur et al. (2014)
Alexandra Balahur, Rada Mihalcea, and Andrés Montoyo. 2014.
Computational approaches to subjectivity and sentiment analysis: Present and envisaged methods and applications.Computer Speech & Language, 28(1):1–6.
- Baxter (1997) Jonathan Baxter. 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7–39.
Daniel Beck. 2017.
Modelling representation noise in emotion analysis using gaussian
Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 140–145.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomáš Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5(1):135–146.
- Bradley and Lang (1994) Margaret M. Bradley and Peter J. Lang. 1994. Measuring emotion: The Self-Assessment Manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25(1):49–59.
- Bradley and Lang (1999) Margaret M. Bradley and Peter J. Lang. 1999. Affective Norms for English Words (Anew): Stimuli, instruction manual and affective ratings. Technical Report C-1, The Center for Research in Psychophysiology, University of Florida, Gainesville, FL.
- Bradley and Lang (2010) Margaret M. Bradley and Peter J. Lang. 2010. Affective Norms for English Words (Anew): Stimuli, Instruction Manual and Affective Ratings. Technical Report C-2, University of Florida, Gainesville, FL.
Buechel and Hahn (2016)
Sven Buechel and Udo Hahn. 2016.
Emotion analysis as a regression problem: Dimensional models and
their implications on emotion representation and metrical evaluation.
Proceedings of the 22nd European Conference on Artificial Intelligence, pages 1114–1122.
- Buechel and Hahn (2017) Sven Buechel and Udo Hahn. 2017. EmoBank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. In EACL 2017 — Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, volume 2, short papers, pages 578–585, Valencia, Spain, April 3–7, 2017.
Buechel and Hahn (2018)
Sven Buechel and Udo Hahn. 2018.
Representation Mapping: A Novel Approach to Generate High-Quality Multi-Lingual Emotion Lexicons.In Proceedings of the 11th International Conference on Language Resources and Evaluation, pages 184–191.
- Caruana (1997) Rich Caruana. 1997. Multitask Learning. Machine Learning, 28(1):41–75.
- Chaumartin (2007) François-Régis Chaumartin. 2007. UPAR7: A knowledge-based system for headline sentiment tagging. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 422–425.
Cho et al. (2014)
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014.
On the properties of neural machine translation: Encoder–decoder approaches.In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111.
- Dietterich (1998) Thomas G Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923.
- Ekman (1992) Paul Ekman. 1992. An argument for basic emotions. Cognition & Emotion, 6(3-4):169–200.
- Felbo et al. (2017) Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1615–1625.
- Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation, pages 3483–3487.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Imbir (2017) Kamil K Imbir. 2017. The Affective Norms for Polish Short Texts (ANPST) Database Properties and Impact of Participants’ Population and Sex on Affective Ratings. Frontiers in Psychology, 8.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 655–665.
- Khanpour et al. (2017) Hamed Khanpour, Cornelia Caragea, and Prakhar Biyani. 2017. Identifying Empathetic Messages in Online Health Communities. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 246–251.
- Khosla et al. (2018) Sopan Khosla, Niyati Chhaya, and Kushal Chawla. 2018. Aff2vec: Affect–enriched distributional word representations. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2204–2218. Association for Computational Linguistics.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations.
- Liu et al. (2017) Fei Liu, Julien Perez, and Scott Nowson. 2017. A Language-independent and Compositional Model for Personality Trait Recognition from Short Texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 754–764.
- Mikolov et al. (2018) Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation, pages 52–55.
- Mikolov et al. (2013) Tomáš Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pages 3111–3119.
- Mohammad and Bravo-Marquez (2017a) Saif Mohammad and Felipe Bravo-Marquez. 2017a. Emotion Intensities in Tweets. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pages 65–77.
- Mohammad and Bravo-Marquez (2017b) Saif Mohammad and Felipe Bravo-Marquez. 2017b. WASSA-2017 Shared Task on Emotion Intensity. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 34–49.
- Mohammad and Kiritchenko (2015) Saif M Mohammad and Svetlana Kiritchenko. 2015. Using Hastags to Capture Fine Emotion Categories from Tweets. Computational Intelligence, 31(2):301–326.
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Peng et al. (2017) Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic dependency parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2037–2048.
- Pinheiro et al. (2017) Ana P. Pinheiro, Marcelo Dias, João Pedrosa, and Ana P. Soares. 2017. Minho Affective Sentences (MAS): Probing the roles of sex, mood, and empathy in affective ratings of verbal stimuli. Behavior Research Methods, 49(2):698–716.
- Rosenthal et al. (2017) Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. Semeval-2017 task 4: Sentiment analysis in twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation, pages 502–518.
- Rubin (2007) Victoria L. Rubin. 2007. Stating with certainty or stating with doubt: Intercoder reliability results for manual annotation of epistemically modalized statements. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 141–144.
- Sedoc et al. (2017) João Sedoc, Daniel Preoţiuc-Pietro, and Lyle Ungar. 2017. Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2, Short Papers), pages 564–571.
- Søgaard and Goldberg (2016) Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 231–235.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958.
- Strapparava and Mihalcea (2007) Carlo Strapparava and Rada Mihalcea. 2007. SemEval-2007 Task 14: Affective text. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 70–74.
- Tang et al. (2014) Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1555–1565, Baltimore, Maryland. Association for Computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 6000–6010.
- Wang et al. (2016) Jin Wang, Liang-Chih Yu, K. Robert Lai, and Xuejie Zhang. 2016. Dimensional sentiment analysis using a regional CNN-LSTM model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 225–230.
- Yu et al. (2017) Liang-Chih Yu, Jin Wang, K. Robert Lai, and Xuejie Zhang. 2017. Refining word embeddings for sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 534–539. Association for Computational Linguistics.