Expanding Subjective Lexicons for Social Media Mining with Embedding Subspaces

12/31/2016 ∙ by Silvio Amir, et al. ∙ Google Inesc-ID 0

Recent approaches for sentiment lexicon induction have capitalized on pre-trained word embeddings that capture latent semantic properties. However, embeddings obtained by optimizing performance of a given task (e.g. predicting contextual words) are sub-optimal for other applications. In this paper, we address this problem by exploiting task-specific representations, induced via embedding sub-space projection. This allows us to expand lexicons describing multiple semantic properties. For each property, our model jointly learns suitable representations and the concomitant predictor. Experiments conducted over multiple subjective lexicons, show that our model outperforms previous work and other baselines; even in low training data regimes. Furthermore, lexicon-based sentiment classifiers built on top of our lexicons outperform similar resources and yield performances comparable to those of supervised models.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise of the social web brought about an unprecedented volume of data about human interactions, paving the way for a new, computational, approach to social sciences. Nowadays, scholars are able to study societal and behavioral dynamics through the analysis of large-scale social networks and vast amounts of social media. For example, natural language processing techniques have been applied to massive microblogging repositories to investigate a wide range of phenomena, such as population well-being 

[Mitchell et al.2013], political participation [Tumasjan et al.2010] and public opinion [O’Connor et al.2010].

One of the key resources to support this kind of analyses are subjective lexicons—i.e., lists of words with semantic annotations. Particularly, sentiment lexicons, which categorize words according to the polarity of sentiment they convey; and emotion lexicons, which quantify the emotional states or responses evoked by a given word (e.g. joy or arousal). Typically, these resources are manually created by experts or via crowdsourcing campaigns; a process that can become expensive and time-consuming. Therefore, manually crafted lexicons are necessarily incomplete, thus failing to capture the use of non-conventional word spellings, slang and new expressions commonly found in social media.

The automatic extraction of lexicons is a well-known and widely studied problem. Most proposed solutions are predicated on the idea that similar words should have similar labels. These solutions differ, essentially, along two axes. First, how the word similarities are captured, e.g. leveraging knowledge bases, such as WordNet [Hu and Liu2004] or from word co-occurrence statistics derived from corpora analysis [Bestgen and Vincze2012]; and second, how the label assignment is operationalized, e.g. with supervised classifiers [Esuli and Sebastiani2006] or using graph-based label propagation algorithms [Rao and Ravichandran2009].

Recent work has also begun exploring neural word embebddings, due to their ability to capture word similarities and latent semantic properties. tang-EtAl2014 tang-EtAl2014 induced Twitter sentiment lexicons using sentiment-specific word embeddings, obtained via distant supervision, whereas Amir Amir proposed a predictive model, leveraging unsupervised embeddings as features. Using this approach, Amir developed the top ranking submission of a lexicon expansion shared task organized by SemEval 2015 [Rosenthal et al.2015]. However, both these methods are inherently limited by their choice of word representations. On the one hand, tang-EtAl2014’s approach uses embeddings tailored to capture sentiment information, but then it can only be used for polarity lexicons. Amir’s method, on the other hand, can be used for other lexicon types, but it uses generic, unsupervised embeddings which are sub-optimal for specific downstream models [Astudillo et al.2015, Labutov and Lipson2013].

In this paper, we present an approach to overcome these limitations. Following Amir, we expand lexicons for social media mining with predictive models, leveraging unsupervised word embeddings features. This allows us to deal with lexicons describing different properties. Unlike their approach, however, we induce and exploit intermediate, task-specific representations via embedding subspace projection [Astudillo et al.2015]. The evaluation was conducted over seven lexicons describing 15 subjective properties. The results show that our models largely outperform the other baselines (with two exceptions). To assess the quality of our lexicons, we built and evaluated lexicon-based Twitter sentiment classifiers. We found that our lexicons: (i) outperform other similar resources; (ii) yield performances comparable to that of supervised models.

2 Related Work

The previous work on automatic lexicon induction, can be roughly divided into two classes: knowledge-based and corpora-based. These approaches assume that there is a small number of words, for which the labels are known (sometimes referred as the seed set). Knowledge-based approaches, then use lexical databases such as WordNet to, e.g., exploit word relations such as antonymy/synonymy or hyponymy/hypernymy between new words and words in the seed set [Hu and Liu2004, Kim and Hovy2006, Rao and Ravichandran2009]. Others, classify new words leveraging distances to known words in the synset graph [Kamps et al.2004]. However, these methods are unsuited for the social web since they rely on formal lexical resources that do not encompass the informal language and writing style typical of this domain.

Corpora-based approaches, implicitly exploit the distributional hypothesis, which works under the assumption that similar words tend to occur in similar contexts [Harris1954]. Based on this idea, word similarities can be computed from a term co-occurrence matrix, built from large corpora, using e.g. the point-wise mutual information (PMI) between terms [Turney and Littman2003, Kiritchenko, Zhu, and Mohammad2014]

or with vector distance metrics, over the space induced by

Latent Semantic Analysis [Bestgen and Vincze2012, Yu et al.2013].

In the last few years, efficient neural language models have been proposed to induce word embeddings (i.e. dense feature vectors) by learning to predict the surrounding contexts of words in large corpora(e.g.  [Pennington, Socher, and Manning2014]). Recent work on lexicon expansion has also explored these representations, due to their ability to capture word similarities and latent semantic properties [Tang et al.2014, Amir et al.2015, Rothe, Ebert, and Schütze2016]

. Nevertheless, because these vectors are estimated by minimizing the prediction errors made on generic, unsupervised tasks, they can be sub-optimal for specific downstream models. In this work, we take this observation into account and demonstrate that better models can be induced, by jointly learning

task-specific representations and predictors.

3 Learning Task-Specific Embeddings

Neural embeddings are distributed representations, i.e. each concept is described by multiple features (vector dimensions can be interpreted as abstract features) and each feature can be involved in describing multiple concepts [Hinton1986]111On the other hand, symbolic representations associate each feature to only one concept—e.g. the one-hot encoding, represents word as a zero vector with the value 1 on the -th dimension.. However, we hypothesize that not all features contribute equally to capture any given aspect of the input. Thus, if we knew exactly which subset of features describe a given property (e.g. sentiment), we could extract a more compact representation containing only the meaningful information. This would eliminate noise and irrelevant aspects of the data. But more importantly, predictors based on smaller representations require fewer free parameters, which makes them easier to train with small datasets without overfiting.

One common solution to extract compact representations from a large feature space, is to perform dimensionality reduction by Principal Component Analysis (PCA). But this is a generic linear transformation that does not take into account the prediction targets, hence it is sub-optimal

222Moreover, word embeddings can be seen as the result of a dimensionality reduction over a sparse word-context co-occurrence matrix (see [Baroni, Dinu, and Kruszewski2014] and [Pennington, Socher, and Manning2014]). Thus it is not clear how to interpret the result of a subsequent dimensionality reduction.. Alternatively, we could use the full embeddings and fit a predictor with a sparsity inducing objective, e.g. using -norm regularization [Tibshirani1996]. This regularizer tries to bring the weights associated to some input dimensions down to zero, essentially eliminating the contribution of some features. However, without any prior knowledge about the structure of the embeddings there is no guarantee that only the irrelevant dimensions would be eliminated. Moreover, given the distributed nature of word embeddings, simply ignoring arbitrary dimensions may degrade their expressiveness. Of course, we could just use vectors with fewer dimensions but small embeddings have less capacity and tend to be better suited to syntactic tasks than to semantic ones [Ling et al.2015].

Word Embedding Sub-spaces

A simple, yet effective solution to this problem consists of estimating linear projections from generic embeddings to lower-dimensional sub-spaces [Astudillo et al.2015]. Given an embedding matrix , where the columns represent words from a vocabulary and is the embedding size, one can induce new representations with a factorization of the input as where , with , is a (learned) linear projection matrix. This is similar to PCA, but here the projection is estimated to directly optimize the prediction of the target labels. Therefore, the transformation corresponds to adapting generic, unsupervised representations into task-specific ones. The intuition is that by aggressively reducing the representation space, the model is forced to learn only the most discriminative aspects of the input with respect to the prediction targets333This is also related to the idea of learning representations with information bottlenecks, e.g with auto-encoders [Hinton and Salakhutdinov2006].

4 Proposed Approach

As we discussed above, lexicons can be expanded under the assumption that similar words should have similar labels. To operationalize this assumption, we capitalize on two fundamental properties of neural word embeddings: first, they encode functional (i.e., semantic and syntactic) similarities in terms of geometric locality. This will allow us to predict consistent labels to inflections, spelling variations and synonyms of a word; and second, they capture latent word aspects, some of which correspond to subjective properties, e.g. sentiment polarity.

Our approach to lexicon expansion consists of training models to predict the labels of pre-existing lexicons, leveraging unsupervised word embeddings as features. We further assume that different aspects captured by these embeddings are encoded in some (unknown) subset of features. Therefore, we adopted astudillo-EtAl:2015:ACL-IJCNLP astudillo-EtAl:2015:ACL-IJCNLP Non-Linear Subspace Embedding model (NLSE), to jointly learn embedding sub-space projections that better represent specific word properties, and the concomitant predictors.

The NLSE is essentially a feed-forward neural network, with a single hidden layer and a factorization of the input layer. Denoting a lexicon of

word/label pairs as where

is a categorical random variable over a set of classes

, the model estimates the probability of each possible category

given a word as


Here, selects the -th column of the embedding matrix (corresponding to word ), is a vector of activations computed by the hidden layer and denotes an element-wise sigmoid non-linearity. The matrix maps the embedding sub-space to the classification space. The parameters are estimated to minimize the inverse log-likelihood of the training data


The original embedding matrix is kept fixed, while the projection parameters are estimated jointly with the predictor, thus the model induce and exploit a compact, task-specific, representation that preserves the rich information captured by the embeddings.

Note that the model in Eq. 1

produces a probability distribution over the output classes, hence it is only suitable for categorical lexicons. Nonetheless, it can be easily adapted to continuous outputs by replacing the

softmax classifier with a simple linear regressor:


where and are the regression weights and bias, respectively. The parameters are estimated by minimizing the Mean Squared Error over the training data


The loss functions in Eq.

2 and Eq. 4 can be minimized with standard gradient based optimization methods. After training, the models can be employed to extrapolate the labels of any word for which an embedding is available. Furthermore, by using embeddings induced from social media, we can adapt any pre-existing lexicon to include terms that are used in this domain.

5 Predicting Lexicon Labels

In this section, we evaluate our method at inferring different subjective word properties, namely the sentiment polarity, happiness level, affective responses (specifically, the valence, arousal and dominance) and emotion association (concretely, Plutchik Plutchik’s basic emotion set). We trained the models described in the previous section to predict the labels assigned by human judges to several well-known subjective lexicons. These are summarized in Table 1, and consist of two groups: categorical lexicons, associating words to specific classes e.g., positive polarity (on the top rows); and real-valued lexicons, assigning continuous values to words e.g., valence (bottom rows).

# Words
Opinion Mining Lexicon (OML) [Hu and Liu2004] 6,787
MPQA [Wilson, Wiebe, and Hoffmann2005] 6,886
Emotion Lexicon (EmoLex) [Mohammad and Turney2013] 14,174
ANEW [Bradley and Lang1999] 1,040
SemEval (Sem-Lex) [Rosenthal et al.2015] 1,515
LabMT [Dodds et al.2011] 10,000
Ext-ANEW [Warriner, Kuperman, and Brysbaert2013] 13,915
Table 1: Lexicons

Experimental Setup

Our approach requires an unlabeled corpus to support the induction of the word embedding matrix. Following Amir Amir, we induced 600 dimensional Structured Skip-gram word embeddings [Ling et al.2015]

. We evaluated several baselines utilizing the same word embeddings as the input, but with predictors based on three variants of Support Vectors Machines (

SVM) and Support Vectors Regression (SVR[Vapnik2000]: (i) linear; (ii) with -norm regularization; and (iii) with non-linear kernel

. For the latter, we used a radial basis function (RBF) kernel of the form

with , where denotes a feature vector. In this case, the models learn a linear function in the space induced by the kernel and the data, which corresponds to a non-linear function in the original space. This baseline corresponds to Amir Amir model. Finally, we considered the linear models but using compact representations obtained with PCA.

The experiments were performed by splitting the labeled data (i.e., the lexicons) in 80% for model training and the remaining 20% for evaluation. Then, for each experiment, 20% of the training data was reserved for hyper-parameter tuning via grid-search. We tuned the misclassification cost in the range for all the SVM and SVR models. Furthermore, we searched over the following, model specific, hyper-parameters: kernel widths, in the range for the RBF kernel baselines; regularization constant in the range for the regularized baselines; and the number of components to keep in the PCA baselines, in the range . Regarding the NLSE model, we optimized the subspace size and the learning rate over the ranges and , respectively.

Figure 1: T-SNE projection of the embeddings associated to words from Sem-Lex, to two dimensions. The points are colored according to their sentiment polarity. The left plot, shows the words represented as 600-dimensional unsupervised embeddings. The right plot, shows the same words represented with task-specific embeddings induced with NLSE model.


The main experimental results are presented in Table 2, for the categorical lexicons, and Table 3, for the continuous ones. We can see that the NLSE largely outperforms all the other baselines, apart from two exceptions, where Amir Amir’s approach (RBF column) does slightly better. The results also show that the support vector models tend to perform better when used with non-linear kernels.

linear RBF PCA
OML sentiment 0.882 0.868 0.686 0.872 0.852
MPQA sentiment 0.691 0.691 0.221 0.669 0.555
subjectivity 0.825 0.819 0.798 0.833 0.805
EmoLex sentiment 0.676 0.630 0.404 0.640 0.468
sadness 0.509 0.340 0.167 0.334 0.000
fear 0.503 0.373 0.261 0.394 0.000
anger 0.468 0.353 0.214 0.366 0.000
disgust 0.446 0.343 0.180 0.352 0.000
joy 0.440 0.333 0.148 0.329 0.000
trust 0.403 0.201 0.190 0.167 0.000
surprise 0.204 0.167 0.093 0.119 0.000
anticipation 0.240 0.108 0.151 0.044 0.000
Table 2: Results for categorical lexicons in terms of Avg.

Regarding the baselines that try to uncover the relevant information from the embeddings (i.e., PCA and

), we can see that they perform very poorly. This was expected, since the former induces a low-rank approximation of the original embedding, that (tries to) preserve most of the variance. However, since word embeddings are distributed representations, the values of individual dimensions are meaningless and should be regarded as coordinates in a high-dimensional space. The latter, on the other hand, tries to drop some of the input dimensions, but in doing so degrades the information contained in the word representations. These approaches are particularly inefficient in the more nuanced properties such as fine-grained emotions. Conversely, these are precisely the cases where our approach stands-out, which underlines the benefits of inducing task-specific representations.

linear RBF PCA
SemLex sentiment 0.667 0.610 0.619 0.630 0.622
LabMT happiness 0.640 0.576 0.573 0.622 0.464
ANEW arousal 0.440 0.365 0.375 0.415 0.389
valence 0.683 0.612 0.604 0.646 0.592
dominance 0.546 0.477 0.456 0.494 0.475
Ext-ANEW arousal 0.393 0.373 0.371 0.397 0.315
valence 0.607 0.567 0.565 0.593 0.494
dominance 0.480 0.445 0.443 0.464 0.405
Table 3: Results for continuous lexicons in terms of Kendall rank correlation

To further illustrate the latter point, we wanted to visualize the effect of the sub-space projection on the word representation space. Therefore, we used maaten2008visualizing maaten2008visualizing T-SNE algorithm to project the embeddings into two-dimensions and plotted the words from Sem-Lex, colored according to their sentiment score. We first used the unsupervised embeddings and then, leveraging a sub-space projection (trained on Sem-Lex), induced and plotted new embeddings for the same words. These two plots are shown in Figure 1. On the left, we can see that unsupervised embeddings can naturally capture sentiment information—words with similar sentiment scores tend to be closer to each other. On the right, we see that in the space induced by the sub-space projection, not only are similar words (w.r.t to sentiment) drawn even closer but also, quite interestingly, the words become arranged in what seems to be a continuum from the most negative to the most positive sentiment polarity.

Figure 2: Performance of the different baselines in predicting the happiness score of words, as a function of the size of the training data.

The overall results show that pre-trained word embeddings can indeed capture a wide range of semantic properties, and be leveraged to induce subjective lexicons. Furthermore, the simplicity of our method suggests that it could be used to derive specific lexicons for different domains or demographics, to reflect the fact that some words are used with different connotations by different groups of people [Yang and Eisenstein2015]. However, this would require creating multiple ‘training’ lexicons, which raises the question of how much data is required to induce high-quality lexicons. To investigate this question, we plotted the performance of the different models, as a function of the training data size (Figure 2). As expected, the performance of all the models monotonically decreases with less training data. Nevertheless, we observe that the performance of our model decays slower than the RBF baseline (the second best method). Notably, when trained with 30% of the data, our model attains the same performance of the RBF baseline trained with 70% of the data.

6 Lexicon Based Twitter Sentiment Analysis

Performance of lexicon-based classifiers built on top of different lexicons
Comparison of lexicon-based classifiers against supervised models.
Figure 3: Results of the sentiment classification experiments

We now report on a set of experiments designed to assess the quality of our lexicons in downstream applications. To this end, we induced a large-scale sentiment lexicon, henceforth denoted as NLSE-Lex, with a model trained on the Sem-Lex lexicon. Then, we developed lexicon-based sentiment classifiers that infer the polarity of messages by aggregating the sentiment scores of individual words. More formally, given a message with words, the overall sentiment is:


where, is the sentiment score associated to word in a given lexicon and is the threshold that separates the positive and negative classes.

We compared the performance of lexicon-based classifiers built on top of the following Twitter lexicons:

  • Sem-Lex, a small manually labeled lexicon. This will provide a baseline performance;

  • NLSE-Lex, induced with our method;

  • Sentiment140 (S140) and Hashtag Sentiment (HL), created using term co-occurrence statistics collected from large corpora [Kiritchenko, Zhu, and Mohammad2014];

  • SSEmb-Lex, obtained using sentiment-specific embeddings [Tang et al.2014].

For simplicity, we converted the labels of all the lexicons to the range and discarded words with scores between

to keep only terms that strongly convey sentiment, as suggested by dodds2011temporal dodds2011temporal. The classification threshold was set with a simple heuristic: we assume that most Twitter posts do not convey any particular sentiment, thus we set the threshold to

. In other words, if the score of a message is above the expected sentiment score, then it is considered positive, otherwise it is negative. Finally, we compared the performance of the lexicon-based classifier to that of supervised approaches. For each test set, we trained two SVM models. One using only bag-of-words (SVM-BOW

) features; and another, combining BOW features with a set of features extracted from the manually created lexicon: the

mean, sum, maximum, minimum and standard deviation of the word sentiment scores (SVM-BOW + Lexicon).

# Training Tweets # Test Tweets
TW-train 6,013 -
TW13 - 2,173
TW14 - 1,183
TW15 - 1,402
OMD 1,306 598
HCR 1,257 665
Table 4:

Summary of the datasets used in the sentiment classification experiments. The top rows correspond to the test sets from SemEval’s Twitter Sentiment Analysis competition; the

TW-train dataset was only used as training data. The bottom rows, correspond to the datasets introduced by speriosu2011twitter speriosu2011twitter.

The classifiers were evaluated on the following five datasets, summarized in Table 4. Three datasets compiled by SemEval for their well-known Twitter sentiment analysis challenge [Rosenthal et al.2015] (TW-13, TW-14 and TW-15) (top rows); and two datasets introduced by speriosu2011twitter speriosu2011twitter—OMD, with reactions to the 2008 USA presidential debate opposing the democrat candidate Barack Obama and republican candidate Jonh Mccain; and HCR, with tweets discussing the 2010 health care reform in the USA. It should be noted that, speriosu2011twitter datasets have standard splits for training, development and testing, hence for ease of comparison, our classifiers were evaluated on the test sets. Furthermore, all the aforementioned datasets are labeled in terms of three classes—positive, negative and neutral, but in these experiments we excluded the neutral class and focused on binary classification.


The results of the sentiment classification experiments are presented in Figure 3. In Figure 3, we compare the performance of the different lexicons over the test data. We observe that our lexicon outperform the others in nearly all cases, with the exception of the HCR dataset where the HL lexicon performs marginally better. However, note that this same lexicon obtains the worst performance on TW14. In Figure 3, we compare the NLSE-Lex with the supervised models. We found that our lexicon-based classifier is extremely competitive and, somewhat surprisingly, even outperforms the supervised baselines in almost all of the datasets.

7 Conclusions

This paper presented a novel approach to induce large-scale subjective lexicons suitable for social media analysis. We exploit the fact that unsupervised word embeddings capture semantic properties of words, and can be used as features for lexicon expansion models. However, instead of using the embeddings directly, we induce and exploit task-specific representations, via sub-space projection. To this end, we leverage the astudillo-EtAl:2015:ACL-IJCNLP astudillo-EtAl:2015:ACL-IJCNLP NLSE model to jointly learn the adapted representations and respective predictor. The experimental results show that our approach outperforms previous work and other related baselines, across multiple lexicons and subjective properties. Working with lower-dimensional representations also allows us to induce predictors with less training data. Indeed, the results demonstrate that, compared to the other baselines, our method can make better use of limited amounts of training data. Finally, we empirically showed how the sub-space projections learned by the NLSE, transform the embedding space to better capture task-specific information.

To assess the quality of our lexicons, first, we compared the performance of lexicon-based sentiment classifiers built on top of ours, and other large-scale Twitter lexicons. We observed that the classifiers built with our lexicons largely outperform the other baselines. Second, we compared our lexicon-based classifier with supervised models and, surprisingly, we found that our lexicon-based model outperforms the more sophisticated models. These results demonstrate the quality of our lexicon and suggest that, with the appropriate lexicons, simple studies (e.g., involving binary sentiment classification) can be performed without the hassle of creating labeled data.


  • [Amir et al.2015] Amir, S.; Ling, W.; Astudillo, R.; Martins, B.; Silva, M. J.; and Trancoso, I. 2015. INESC-ID: A regression model for large scale twitter sentiment lexicon induction. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 613–618.
  • [Astudillo et al.2015] Astudillo, R.; Amir, S.; Ling, W.; Silva, M.; and Trancoso, I. 2015. Learning word representations from scarce and noisy data with embedding subspaces. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1074–1084.
  • [Baroni, Dinu, and Kruszewski2014] Baroni, M.; Dinu, G.; and Kruszewski, G. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • [Bestgen and Vincze2012] Bestgen, Y., and Vincze, N. 2012. Checking and bootstrapping lexical norms by means of word similarity indexes. Behavior Research Methods 44(4).
  • [Bradley and Lang1999] Bradley, M. M., and Lang, P. J. 1999. Affective norms for english words (anew): Instruction manual and affective ratings. Technical report, Citeseer.
  • [Dodds et al.2011] Dodds, P. S.; Harris, K. D.; Kloumann, I. M.; Bliss, C. A.; and Danforth, C. M. 2011. Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter. PloS one 6(12):e26752.
  • [Esuli and Sebastiani2006] Esuli, A., and Sebastiani, F. 2006. Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC, volume 6, 417–422. Citeseer.
  • [Harris1954] Harris, Z. S. 1954. Distributional structure. Word 10(2-3):146–162.
  • [Hinton and Salakhutdinov2006] Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504–507.
  • [Hinton1986] Hinton, G. E. 1986. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, volume 1,  12. Amherst, MA.
  • [Hu and Liu2004] Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge discovery and data mining, 168–177.
  • [Kamps et al.2004] Kamps, J.; Marx, M.; Mokken, R. J.; and de Rijke, M. 2004. Using wordnet to measure semantic orientations of adjectives. In Proceedings of 4th International Conference on Language Resources and Evaluation, Vol IV,, 1115–1118.
  • [Kim and Hovy2006] Kim, S.-M., and Hovy, E. 2006. Identifying and analyzing judgment opinions. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 200–207.
  • [Kiritchenko, Zhu, and Mohammad2014] Kiritchenko, S.; Zhu, X.; and Mohammad, S. M. 2014. Sentiment analysis of short informal texts.

    Journal of Artificial Intelligence Research

  • [Labutov and Lipson2013] Labutov, I., and Lipson, H. 2013. Re-embedding words. In Proceedings of the 51st annual meeting of the ACL, 489–493.
  • [Ling et al.2015] Ling, W.; Dyer, C.; Black, A. W.; Trancoso, I.; Fermandez, R.; Amir, S.; Marujo, L.; and Luis, T. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1520–1530.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne.

    Journal of Machine Learning Research

  • [Mitchell et al.2013] Mitchell, L.; Harris, K. D.; Frank, M. R.; Dodds, P. S.; and Danforth, C. M. 2013. The geography of happiness: connecting twitter sentiment and expression, demographics, and objective characteristics of place. PLoS ONE 8(5).
  • [Mohammad and Turney2013] Mohammad, S. M., and Turney, P. D. 2013. Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29(3).
  • [O’Connor et al.2010] O’Connor, B.; Balasubramanyan, R.; Routledge, B. R.; and Smith, N. A. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. Proceedings of the 2014 Empiricial Methods in Natural Language Processing 12.
  • [Plutchik1980] Plutchik, R. 1980. A general psychoevolutionary theory of emotion. Academic press. 3–33.
  • [Rao and Ravichandran2009] Rao, D., and Ravichandran, D. 2009. Semi-supervised polarity lexicon induction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 675–682.
  • [Rosenthal et al.2015] Rosenthal, S.; Nakov, P.; Kiritchenko, S.; Mohammad, S. M.; Ritter, A.; and Stoyanov, V. 2015. Semeval-2015 task 10: Sentiment analysis in twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval ’2015. Denver, Colorado: Association for Computational Linguistics.
  • [Rothe, Ebert, and Schütze2016] Rothe, S.; Ebert, S.; and Schütze, H. 2016. Ultradense word embeddings by orthogonal transformation. arXiv preprint arXiv:1602.07572.
  • [Speriosu et al.2011] Speriosu, M.; Sudan, N.; Upadhyay, S.; and Baldridge, J. 2011. Twitter polarity classification with label propagation over lexical links and the follower graph. In

    Proceedings of the First workshop on Unsupervised Learning in NLP

    , 53–63.
    Association for Computational Linguistics.
  • [Tang et al.2014] Tang, D.; Wei, F.; Qin, B.; Zhou, M.; and Liu, T. 2014. Building large-scale twitter-specific sentiment lexicon : A representation learning approach. In Proceedings of the 25th International Conference on Computational Linguistics, 172–182.
  • [Tibshirani1996] Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
  • [Tumasjan et al.2010] Tumasjan, A.; Sprenger, T. O.; Sandner, P. G.; and Welpe, I. M. 2010. Predicting elections with twitter: What 140 characters reveal about political sentiment. In 4th International AAAI Conference on Weblogs and Social Media.
  • [Turney and Littman2003] Turney, P. D., and Littman, M. L. 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems 21(4).
  • [Vapnik2000] Vapnik, V. 2000.

    The nature of statistical learning theory

    Springer Science & Business Media.
  • [Warriner, Kuperman, and Brysbaert2013] Warriner, A. B.; Kuperman, V.; and Brysbaert, M. 2013. Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior research methods 45(4):1191–1207.
  • [Wilson, Wiebe, and Hoffmann2005] Wilson, T.; Wiebe, J.; and Hoffmann, P. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and empirical methods in natural language processing, 347–354. Association for Computational Linguistics.
  • [Yang and Eisenstein2015] Yang, Y., and Eisenstein, J. 2015. Putting things in context: Community-specific embedding projections for sentiment analysis. arXiv preprint arXiv:1511.06052.
  • [Yu et al.2013] Yu, L.-C.; Wu, J.-L.; Chang, P.-C.; and Chu, H.-S. 2013. Using a contextual entropy model to expand emotion words and their intensity for the sentiment classification of stock market news. Knowledge-Based Systems 41(0).