Log In Sign Up

Contextual and Position-Aware Factorization Machines for Sentiment Classification

While existing machine learning models have achieved great success for sentiment classification, they typically do not explicitly capture sentiment-oriented word interaction, which can lead to poor results for fine-grained analysis at the snippet level (a phrase or sentence). Factorization Machine provides a possible approach to learning element-wise interaction for recommender systems, but they are not directly applicable to our task due to the inability to model contexts and word sequences. In this work, we develop two Position-aware Factorization Machines which consider word interaction, context and position information. Such information is jointly encoded in a set of sentiment-oriented word interaction vectors. Compared to traditional word embeddings, SWI vectors explicitly capture sentiment-oriented word interaction and simplify the parameter learning. Experimental results show that while they have comparable performance with state-of-the-art methods for document-level classification, they benefit the snippet/sentence-level sentiment analysis.


page 1

page 2

page 3

page 4


Hybrid Improved Document-level Embedding (HIDE)

In recent times, word embeddings are taking a significant role in sentim...

Learning Domain-Sensitive and Sentiment-Aware Word Embeddings

Word embeddings have been widely used in sentiment classification becaus...

Hierarchical Interaction Networks with Rethinking Mechanism for Document-level Sentiment Analysis

Document-level Sentiment Analysis (DSA) is more challenging due to vague...

Enhancing Fine-grained Sentiment Classification Exploiting Local Context Embedding

Target-oriented sentiment classification is a fine-grained task of natur...

Improving Twitter Sentiment Classification via Multi-Level Sentiment-Enriched Word Embeddings

Most of existing work learn sentiment-specific word representation for i...

Latent Variable Sentiment Grammar

Neural models have been investigated for sentiment classification over c...

An Empirical Study on Leveraging Position Embeddings for Target-oriented Opinion Words Extraction

Target-oriented opinion words extraction (TOWE) (Fan et al., 2019b) is a...

1 Introduction

Although machine learning-based methods have achieved great success for sentiment classification Liu (2012), they have some limitations in explicitly capturing or presenting sentiment-oriented word interactions. Here the sentiment-oriented (SO) word interaction means that when two (or more) individual words function together as a sentiment expression (in a review snippet), the effect of word-wise interaction determines the sentiment orientation of that snippet. For example, in an online review snippet“this button is hard to push”, besides the sentiment polarity of every single word, the word interaction between “hard” and “push” indicates a negative signal for this snippet. Although some of the existing models consider such SO word interaction, it is coarsely modeled or implicitly captured, which we will detail in section 2.

The lack of SO word interaction may not be critical for coarse sentiment classification at the document level (a full review), but the involvement of such SO interaction can play a crucial role in finer-grained analysis at the snippet level (a phrase or short sentence). The reasons are: (1) while a long document contains rich content, only limited text information is available in a short snippet, and (2) some salient opinion words (e.g., “good” and “amazing”) can dominate document classification but these words may not always appear in short expressions classification.

Specifically, the sentiment expression in a review snippet may consist of multiple words. Let us take a further look at the aforementioned example “the button is hard to push”. We can see the “hard” and “push” are used to deliver a negative opinion from a customer. However, “hard” or “push” independently indicate no clear sentiment. Notice that there are also other snippets like “this is a hard (cellphone) case” where “hard” and “case” together assign a positive sentiment. In the above examples, an individual word like “hard”, “case” or “push” is not able to determine the whole sentiment polarity of a snippet. Instead, the word interactions play more important roles in identifying sentiment in such snippets, e.g., “hard” and “push” together indicate a negative opinion while “hard” and “case” interactively specify a positive opinion.

This paper proposes a solution that can capture such interaction explicitly by exploiting Factorization Machine Rendle (2010). Factorization Machine (FM), which is widely used in recommender systems, is a general approach that can break the independence of interaction variables. It suggests a possible way to realize our goal, i.e., to capture the SO interaction. However, direct application of FM is not suitable due to two main reasons. First, while FM aims at learning the global interaction between all elements, it neglects the importance of modeling (local) context in text. Different from recommender systems, contextual information plays a key part in sentiment analysis. Second, while the position/ordering of different features/fields in recommender systems may not be sensitive, the position information of words is an indicative signal in text data.

To address them, we first propose Contextual Factorization Machine (CFM) which models context by capturing the focused interactions for a specific sentiment expression. After that we propose Position-aware Factorization Machine (PFM) to further encode position information. In these two models, the word interaction, context and position information are jointly learned by a set of vectors termed sentiment-oriented Word Interaction (SWI) vectors. Compared to word embeddings that are widely used in neural models for sentiment classification, SWI vectors explicitly capture SO word interaction and simplify the parameter learning. Experimental results show that while they give comparable performance with state-of-the-art methods for document-level classification, they effectively benefit the snippet-level analysis.

This paper makes the following contributions:

  1. It proposes a new solution to explicitly model sentiment-oriented word interaction for fine-grained sentiment analysis.

  2. It proposes two new models called CFM and PFM to learn a set of Sentiment-sensitive Word Interaction (SWI) vectors. Such vectors jointly capture word interaction, context and position information and also simplify the parameter learning.

  3. Comprehensive experiments are conducted on three real-world review datasets at document and snippet/sentence level. By comparison with state-of-the-art models, experimental result shows the effectiveness of our approaches.

2 Sentiment-Oriented Word Interaction

In this section, we review some state-of-the-art machine learning methods for sentiment classification and analyze their limitations in modeling sentiment-oriented (SO) word interaction. Although some of them consider SO word interaction; however, it is coarsely modeled or implicitly captured. Based on the fashion of word representation, they are generally grouped into Bag-of-Words (BoW) and Word Embedding (WE) based methods. We illustrate them as follows and related notations are shown in Table 1.

2.1 Bag-of-Words (BoW) based Methods

In the Bag-of-Words model, words are indexed and text documents are converted to vectors. The values in such vectors can be word occurrence, word counts or TF-IDF. In this case, sentiment information is learned by corresponding model parameters. Specifically, in linear models like Logistic Regression (LR) in Equation 

1 or non-linear models like SVM with feature projections (or kernels) in Equation 2, we can see

is the parameter capturing sentiment information under supervised learning. For simplicity of illustration, bias terms are excluded here but they will still be used in our experiments.


When BoW is used in linear models like LR, the sentiment information captured by is interpretable but we cannot measure the direct interaction between words. For example, a learned parameter of the word “good” (i.e., ) can make a text snippet containing “good” more likely to be predicted as positive (when 0), which is straightforward. However, it has problems when predicting sentiment for snippets containing “hard” and “push” or “hard” and “case”. Ideally, their word-wise interaction should be considered but here the sentiment polarity is determined by which are independent variables. On the other hand, when BoW is used with non-linear projection like SVM with non-linear kernels, it may help solve the problem but it is harder to track the SO word interaction due to the non-linear feature projections.

SVM with Polynomial Projection (SVM-Poly)
Using BoW with a -degree polynomial feature projection (or -poly kernel) is an exception, for example, the 2-poly kernel. Its feature mapping and prediction function are shown in Equation 3 and 4. Note that the bias and linear terms are excluded for simplicity in Equation 4. One can see this approach is capable of capturing word interactions, for instance, the direct interaction of “hard” and “case” can be parameterized as . However, this is still problematic as all such interaction parameters are independent. That is, we can learn and but they are two isolated parameters, regardless of the fact that they share the word “hard”. In addition, this approach suffers from data sparsity and we need parameters and time complexity, specially when the key words for interaction are distant, like “hard” and “case” in “the case I bought is really hard”. Finally, it is worth noting that the problems of -gram BoW model () are very similar to the ones of SVM-Poly by shifting the encoding of word-pairs from to . For consistency, we will use SVM-Poly as a general case for the following discussion.


2.2 Word Embedding (WE) based Methods

Recently word embeddings become widely used in many machine learning models. Generally speaking, the embeddings for words are mainly learned by maximizing the likelihood of correct prediction of contextual information, e.g., the skip-gram model Mikolov et al. (2013a). Consequently, words that are semantically similar have similar representations, e.g., “cost” and “price”. However, such word embeddings do not directly carry sentiment information, e.g., “good” and “bad” are also neighbors111Observed from the default result generated by word2vec. in word vector space but they actually hold opposite sentiment polarities.

a word feature, e.g., word occurrence
a sequence of word features
feature mapping/projection
a prediction, e.g., sentiment polarity
sigmoid function
a word vector/factor
a model parameter for learning
bias term

a non-linear activation function

dimension of a factor/vector
distance (context size)
number of convolutional layers
number of channels
word window size (length of a CNN filter)
average number of words in a document
number of words in document d
number of words in vocabulary
distance between two words i and j
gradient of v(i,k)
regularization term
learning rate
Table 1: Definition of Notations

A natural way to learning sentiment information with WE is following the manner of BoW, i.e., to train a classifier like LR/SVM. Certainly, deep learning models like convolutional neural network (CNN) can be more suitable to employ word embeddings as input with their particular architecture designs. Recently some advanced models jointly encode semantic and sentiment information in word vector space 

Kim (2014); Tang et al. (2014). However, they still do not explicitly reflect SO word interaction.

The reason is, while most of the neural network models are based on full sentence/document modeling, they are coarse-grained in nature and not good at capturing fine-grained information at the word/snippet level He and Lin (2016). Specifically, let us first investigate the convolution operation in Equation 5. Note that a word is now denoted by a vector . The learning parameter is applied to a window of words to generate a convolutional feature . A feature map

is then obtained by processing a sequence of words. After that, a max pooling operation is applied to take the maximum value 

Collobert et al. (2011) for each such feature map.

Although here can capture the interaction between different words, it is not used in a word-wise manner, i.e., the SO word interaction is modeled implicitly, as is is hard to measure the direct interaction between two particular words, say “hard” and “case”. However, they can be explicitly encoded in in SVM-Poly. Also, due to the non-linear function and pooling operation in CNN, the specific word interaction between two words becomes harder to track.


3 Proposed Factorization Machines

As discussed above, some related models coarsely consider SO word interaction and they have limitations. For example, CNN considers the contextual relationship between words and encodes all interactions in a general parameter but the specific word-wise interaction is implicitly modeled and hard to track. SVM-Poly encodes SO word interaction in parameters like but it requires such parameters and those parameters are all independent.

It will be a promising direction if we can adopt their advantages while overcome their shortcomings in a joint modeling process. Motivated by this, Factorization Machine (FM) is exploited by us. However, notice that FM is originally used in recommender system and does not directly applicable for fine-grained sentiment analysis. Therefore, we propose two new models CFM and PFM.

In this section, we first introduce the basis of FM. We then illustrate how to exploit it and point out its problems in fine-grained sentiment analysis. After that, we represent our new models, the optimization approach, and the analysis of complexity.

3.1 Factorization Machine Basis

Factorization Machine (FM)  Rendle (2010) was proposed as a generic framework to learn the dependency of interaction variables by factorizing them into latent factors. A factor can be generally understood as a vector. So in this paper we will use the term factor and vector interchangeably. The model equation of 2-degree factorization machine is presented in Equation 6. Here is called a factor/vector for element () and denotes the dot product operation of and . Similarly, its linear and bias terms are not included here but will be used in experiments.


3.2 Exploiting FM for Sentiment Analysis

This is for test. We exploit FM for sentiment analysis in the following manner: while is used as word features like LR/SVM for the word at position , the factor can be viewed as its word vector. However, different from the traditional word embedding, the word vector here carries word-wise interaction information and is sentiment-sensitive. We refer it as sentiment-sensitive Word Interaction (SWI) vector. In this setting, the SO interaction between two words is determined by the dot product of their SWI vectors, for example, the interaction between “hard” and “case” is denoted by .

Let us compare FM (Equation 6) and SVM-Poly (Equation 4) for a better understanding. By comparison, one can see that instead of using an independent interaction parameter , here the SO interaction effect of two words is jointly determined by two SWI vectors and . Recall that and are two isolated parameters in SVM-Poly, but in our case, and can reflect that they share the same word “hard”, because they both contain the SWI vector . Meanwhile, note that their resulting sentiment polarities are different, when interacting with is probe to the negative class (e.g., , where indicate a negative sentiment class) and interacting with is close to positive class (e.g., ). The SWI vectors such as and are jointly learned under the supervision of sentiment labels.

However, this direct application of FMs is not suitable for fine-grained analysis at the snippet level due to two main reasons: (1) they lack the modeling of contextual information; (2) they do not consider the position/ordering of words. But they are two important signals to connect aspect and opinion information for sentiment reasoning at the snippet level. To address them, two new models are proposed and introduced below.

3.3 Contextual Factorization Machine

We first propose Contextual Factorization Machine (CFM). Different from other existing FMs, CFM models contextual information in text. Note that in fine-grained analysis, we aim at detecting sentiment for an aspect-specific opinion expression, for example, a snippet the screen is very clear from a full review “I made this purchase two days ago … the screen is very clear … ”. We observe that to generally determine the sentiment of that snippet, there is no need to catch fully pairwise word interactions. Concretely, the sentiment interaction between word “screen” to the first/last few words in the original full review could be less informative. Those words may even not be related to “screen” but another aspect (e.g., “purchase”). As a result, their word interactions can be harmful if they are involved in learning. So an intuitive solution is to focus on the interactions constructed by nearest contextual words. In other words, by capturing the most significant word dependency, CFM can learn fine-grained SO interaction more accurately. Its model equation is shown in Equation 7. The idea is to impose a constraint so that word interactions will be learned within a certain distance, which is inspired by the neural skip-gram model. However, here it is designed for better alignment of aspect and opinion information but not for word prediction.


3.4 Position-aware Factorization Machine

One shortcoming of CFM is that it considers the same words with different word positions identical in SO interaction, which is not always true. In fact, word positions may be helpful to distinguish different sentiment polarities in some cases. To incorporate this indicative signal, we create a more comprehensive model named Position-aware Factorization Machine (PFM), where the SO word interaction, context and position information are jointly learned by the SWI vectors. Equation 8 shows its model equation. Compared to CFM, is newly-designed to denote the distance between two words. Take the snippet “the case is very hard” again as an example and we will have , i.e., the distance between “case” and “hard” is 3, and now their SO word interaction is depicted as .


3.5 Optimization

In terms of learning, we formulate our task as an optimization problem. Since the sentiment information needs to be learned by supervision from document labels (positive/negative), we use logistic loss to optimize. In addition, we impose L2 regularization parameterized by . Following Jahrer et al. (2012)

, mini-batch based stochastic gradient descent (SGD) is used. We also implement the adaptive learning-rate schedule 

Zeiler (2012) to boost our learning process. Particularly, AdaGrad Duchi et al. (2011) is adopted. We show the gradient of the factor in CFM below and the gradient for PFM can be derived similarly. is the regularization term and is the learning rate.


3.6 Complexity and Analysis

We report the number of parameters and complexity for learning in Table 2. is the average length of one document. is the distance indicator and is the vector dimension. CNN-S means the CNN model using static word embeddings as input and CNN-J means the CNN model jointly learning word embeddings. The meaning of other symbols can be found in Table 1. We have the following observations: (1) both CFM and PFM are linear in for the growth of variables and complexity. (2) CFM and PFM are faster than FM because FM calculates all pairwise interactions222The proof of for FM is reported inRendle (2010) while they do not need to. (3) CFM and PFM are both less complicated than SVM-Poly. While parameters are required to learn all pairwise word interactions in SVM-Poly, only ones are needed for FMs. (4) Compared to CNNs, CFM and PFM simplify the learning process. That is because while word embeddings are used in the input layer and CNN learns the sentiment information by other deep layers, the sWI vectors used in CFM and PFM jointly encode all related information.

Parameters Time Complexity

FM nk
CFM nk
PFM nkt
Table 2: Comparison of the number of parameters and computing complexity with related models

4 Experiments

Our evaluation is two-step. First, we conducted sentiment classification at the document level using full online reviews. Second, we used the models built from full reviews to classify review snippets (phrases or sentences). Specifically, a set of review snippets with human labels (positive/negative) was used as our prediction targets while we still utilized the same set of full documents for training. The intuition is that, as discussed in section 1, the SO word interaction may have limited impact at the document level, but it plays a crucial role for fine-grained analysis at the snippet level, because a short text usually contains limited information or has less strong salient opinion words (e.g., “excellent”). In this case, when all candidate models are trained on a same set of full documents, a model better capturing explicit SO word interaction should be able to identify the sentiment of a short snippet more accurately.

4.1 Datasets

We use three real-word review datasets. The label for a full review can be directly obtained because a rating score is often provided, but the label for a text snippet requires human labeling. We thus download the human-labeled snippets from the UCI dataset333 as our test sets in our second step. They are word phrases or short sentences (snippets from full reviews) about movie from IMDB, cellphone from Amazon and restaurant from Yelp. Each set contains 1000 snippets (500 positive/negative). For full reviews, we use the movie review dataset from Pang and Lee (2004) which contains 1000 positive and 1000 negative movie reviews, cellphone reviews from McAuley et al. (2015), and restaurant reviews from Yelp 444 For consistency, 1000 positive and 1000 negative reviews are sampled from cellphone and restaurant. Reviews with rating scores 5 and 4 are treated as positive and scores 2 and 1 are treated as negative like Chen et al. (2015); Johnson and Zhang (2014). For the first task, the document-level sentiment classification, we conduct 5-folds cross validation using only full reviews. We split the data to 70%, 10%, 20% for training, validation, and testing for each data set. For the second task, we use the classifiers trained by all full reviews to classify review snippets. Notice that we have kept the same parameter settings for the classifiers used in both tasks. The average document/snippet length of each data set is reported in Table 3.

Training (docs) Testing (snippets)
Dataset Source Average Length Average Length

Amazon 90 10
Restaurant Yelp 70 11
Movie IMDB 668 15
Table 3: Data Information

4.2 Candidate Models for Comparison

FM: The classic factorization machine.
CFM: Contextual Factorization Machine.
PFM: Position-aware Factorization Machine.
SVM-BoW: Linear SVM with Bag-of-Words.
SVM-Poly: SVM with the Poly kernel.
LR-BoW: Logistic regression with Bag-of-Words, similar to SVM-BoW in settings.
SVM-WE: SVM with word embeddings. The averaged feature values of word embeddings are used for review representations. This is a setup to evaluate the embedding contributions  Ding et al. (2015); Kotzias et al. (2015). We use word2vec555 to train vectors for every dataset.
LR-WE: Logistic regression with word embeddings, similar to SVM-WE in settings.
CNN-S: Convolutional Neural Network, a representative neural model for sentiment classification using word embeddings Kalchbrenner et al. (2014); Kim (2014). The word embeddings are pre-trained and used as static input.
CNN-J: another CNN whose word embeddings will be jointly learned with other parameters in the deep layers Kim (2014); Tang et al. (2014). We put pre-trained word embeddings for initialization.
CNN-S(+): Similar to CNN-S, but we increase the length of word embedding to 300 (i.e., longer vectors), while the length of word embedding in above CNNs is maintained the same as in all FMs.
CNN-J(+): Similar to CNN-J, but we increase the word embeddings length following CNN-S(+).

4.3 Parameter Settings

For CFM and PFM, we set the context size to 5 following related vector learning approaches Mikolov et al. (2013b); Mnih and Kavukcuoglu (2013) and the dimension of word vector to 10. We did pilot experiments and found a bigger vector length does not have significant influence, which indicates that to learn SWI vectors a small vector length is enough. We maintain the same setting for training skip-gram vectors which are used in SVM-WE, LR-WE, CNN-S and CNN-J. We learn vectors with a large length  Mikolov et al. (2013b) for CNN-S(+) and CNN-J(+). The learning rate is empirically set to 0.01 for CFM and PFM. The regularization term is set to 1 for CFM for all data sets, which works consistently well. Bias and L2 regularization terms are also used in SVM-BoW, LR-BoW, SVM-WE and LR-WE for consistency. We follow the parameter settings from Kim (2014) for CNNs.

Document Snippet
Models Accuracy F1-Score Accuracy F1-Score

0.863 0.862 0.711 0.709
SVM-BoW 0.855 0.855 0.709 0.706
SVM-Poly 0.832 0.832 0.500 0.333
LR-WE 0.661 0.661 0.654 0.649
SVM-WE 0.667 0.666 0.658 0.652
CNN-S 0.683 0.676 0.601 0.555
CNN-J 0.747 0.745 0.609 0.570
CNN-S(+) 0.851 0.852 0.729 0.722
CNN-J(+) 0.851 0.851 0.738 0.732
FM 0.765 0.764 0.607 0.540
CFM 0.826 0.822 0.785 0.784
PFM 0.850 0.850 0.789 0.788
Table 4: Document and snippet classification results for movie. The numbers in bold highlight the best results in FMs and other baselines.

Document Snippet
Models Accuracy F1-Score Accuracy F1-Score

0.838 0.837 0.823 0.823
SVM-BoW 0.837 0.836 0.824 0.824
SVM-Poly 0.823 0.823 0.740 0.733
LR-WE 0.767 0.767 0.737 0.733
SVM-WE 0.767 0.766 0.736 0.732
CNN-S 0.754 0.755 0.731 0.731
CNN-J 0.823 0.823 0.799 0.799
CNN-S(+) 0.859 0.859 0.815 0.815
CNN-J(+) 0.849 0.849 0.810 0.810
FM 0.786 0.786 0.745 0.731
CFM 0.835 0.833 0.822 0.821
PFM 0.842 0.841 0.833 0.833
Table 5: Document and snippet classification results for cellphone. The numbers in bold highlight the best results in FMs and other baselines.

Document Snippet
Models Accuracy F1-Score Accuracy F1-Score

0.809 0.808 0.812 0.812
SVM-BoW 0.808 0.807 0.811 0.810
SVM-Poly 0.778 0.777 0.594 0.520
LR-WE 0.743 0.742 0.690 0.689
SVM-WE 0.742 0.741 0.683 0.681
CNN-S 0.787 0.787 0.739 0.739
CNN-J 0.823 0.822 0.785 0.785
CNN-S(+) 0.827 0.827 0.805 0.804
CNN-J(+) 0.846 0.846 0.818 0.818
FM 0.790 0.788 0.698 0.674
CFM 0.839 0.839 0.838 0.837
PFM 0.839 0.839 0.8424 0.842
Table 6: Document and snippet classification results for restaurant. The numbers in bold highlight the best results in FMs and other baselines.

4.4 Experimental Results and Analysis

The experimental results are given in Table 4,  5 and  6. In every table, the left hand side shows the accuracy and F1-Score for document-level sentiment classification and the right hand side shows the snippet-level ones. A table consists of two parts, where the lower part belongs to FM models (FMs) and the upper part presents the baselines. The highest scores are marked in bold for both parts.

First, we have the following observations at the document level (from first two columns in tables):

  1. [topsep=0pt,leftmargin=*]

  2. Most models have competitive performance (except LR-WE, SVM-WE, CNN-S and CNN-J), which implies that when rich information is available in a full review, simply summarizing its overall sentiment orientation can alleviate the problem of lacking SO word interaction.

  3. Our proposed models CFM and PFM are comparable to state-of-the-art baselines. While a small vector length and simplified parameters are used in CFM and PFM for learning, their performance is close to CNN-S/J(+), which is encouraging.

  4. CFM and PFM outperform FM in all data sets, which shows their superiority and rationality in sentiment analysis.

Second, we have the following observations at the snippet level (from last two columns in tables):

  1. [topsep=0pt,leftmargin=*]

  2. Our proposed models CFM and PFM dramatically outperform other baselines significantly in this fine-grained setting. In addition, they can consistently achieve good results on three data sets.

  3. Compared to CNN-S/J(+), CFM and PFM have better performance, even when these CNNs use bigger size vectors (length 300). It is worth noting that these two CNNs actually achieve very good results at the document level (see the left two columns) and their parameters are well fit; however, they do not perform well at the snippet level as ours, which indicates the effectiveness of capturing SO word interaction.

  4. FM preforms very poorly and we can see the significant improvement gain from CFM and PFM which demonstrates their effectiveness for fine-grained sentiment analysis.

Third, we have the following further observations:

  1. [topsep=0pt,leftmargin=*]

  2. Considering the performances on both two settings, we can see the robustness of PFM and also CFM, where they can achieve consistently stable results.

  3. We also tried SVM with different kernels including SVM-Poly, but the linear SVM achieves the best results consistently. It has also been reported by researchers that the linear kernel performs the best for binary text classification. Joachims (1998); Colas and Brazdil (2006); Fei and Liu (2015).

5 Related Work

In machine learning context, Bag-of-Words models were first used for building classifiers Pang et al. (2002) for sentiment classification. Later, dense low-dimensional word vector becomes a better alternative Blei et al. (2003) for word representation. Recently word embeddings like skip-gram model Mikolov et al. (2013a) have shown their superiority in many NLP tasks Turian et al. (2010)Maas et al. (2011) first introduced a topic model variant to jointly encode sentiment and semantic information; later with the development of CNN Collobert et al. (2011) in text mining, joint CNN models Kim (2014); Tang et al. (2014) achieve better and state-of-the-art results. But they did not conduct fine-grained analysis at the snippet level. Tang et al. (2016); Li et al. (2017) performed aspect-level sentiment classification using aspect labels for training and testing, which is essentially different from our task. None of the above work explicitly captures SO word interaction.

Another related work is from He and Lin (2016) who considered the word interaction problem but aimed at mapping similar word interactions across different sentences. Johnson and Zhang (2014) inspected position information but it is not for modeling word interactions.

The concept of SO word interaction is related to sentiment negation/shifter theory Polanyi and Zaenen (2006), contextual polarity Wilson et al. (2009) and sentiment composition related works Choi and Cardie (2008); Moilanen and Pulman (2007); Neviarouskaya et al. (2009). However, they do not target at learning a joint model with information encoded in SWI vectors like us. Also, we do not use external resources like NLP parser Socher et al. (2013); Naseem et al. (2010) to help infer sentiment information.

Factorization Machine ( Rendle (2010) is a general approach that learns feature conjunctions. It is widely used for recommender system Jahrer et al. (2012); Petroni et al. (2015); Juan et al. (2016) but existing FMs are not suitable for sentiment analysis. we also compared the classical FM in our experiments.

6 Conclusion

This paper introduced a framework that can explicitly capture sentiment-oriented word interaction by learning a set of sentiment-sensitive Word Interaction (SWI) vectors. Specifically, two new models were developed, namely, Contextual Factorization Machine (CFM) and Position-aware Factorization Machine (PFM). They benefit fine-grained analysis at the snippet level and also simplify the parameter learning. Extensive experimental results show their effectiveness.


  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
  • Chen et al. (2015) Zhiyuan Chen, Nianzu Ma, and Bing Liu. 2015. Lifelong learning for sentiment classification. In ACL (2), pages 750–756.
  • Choi and Cardie (2008) Yejin Choi and Claire Cardie. 2008. Learning with compositional semantics as structural inference for subsentential sentiment analysis. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    , pages 793–801. Association for Computational Linguistics.
  • Colas and Brazdil (2006) Fabrice Colas and Pavel Brazdil. 2006. Comparison of svm and some older classification algorithms in text classification tasks. In

    IFIP International Conference on Artificial Intelligence in Theory and Practice

    , pages 169–178. Springer.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
  • Ding et al. (2015) Xiao Ding, Ting Liu, Junwen Duan, and Jian-Yun Nie. 2015. Mining user consumption intention from social media using domain adaptive convolutional neural network. In AAAI, volume 15, pages 2389–2395.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
  • Fei and Liu (2015) Geli Fei and Bing Liu. 2015. Social media text classification under negative covariate shift. In EMNLP, pages 2347–2356.
  • He and Lin (2016) Hua He and Jimmy Lin. 2016. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In Proceedings of NAACL-HLT, pages 937–948.
  • Jahrer et al. (2012) Michael Jahrer, Andreas Töscher, Jeong-Yoon Lee, J Deng, Hang Zhang, and Jacob Spoelstra. 2012. Ensemble of collaborative filtering and feature engineered models for click through rate prediction. In KDDCup Workshop.
  • Joachims (1998) Thorsten Joachims. 1998.

    Text categorization with support vector machines: Learning with many relevant features.

    In European conference on machine learning, pages 137–142. Springer.
  • Johnson and Zhang (2014) Rie Johnson and Tong Zhang. 2014. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058.
  • Juan et al. (2016) Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 43–50. ACM.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • Kotzias et al. (2015) Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. 2015.

    From group to individual labels using deep features.

    In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606. ACM.
  • Li et al. (2017) Cheng Li, Xiaoxiao Guo, and Qiaozhu Mei. 2017. Deep memory networks for attitude identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 671–680. ACM.
  • Liu (2012) Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167.
  • Maas et al. (2011) Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52. ACM.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Mnih and Kavukcuoglu (2013) Andriy Mnih and Koray Kavukcuoglu. 2013.

    Learning word embeddings efficiently with noise-contrastive estimation.

    In Advances in neural information processing systems, pages 2265–2273.
  • Moilanen and Pulman (2007) Karo Moilanen and Stephen Pulman. 2007. Sentiment composition. In Proceedings of RANLP, volume 7, pages 378–382.
  • Naseem et al. (2010) Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using universal linguistic knowledge to guide grammar induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1234–1244. Association for Computational Linguistics.
  • Neviarouskaya et al. (2009) Alena Neviarouskaya, Helmut Prendinger, and Mitsuru Ishizuka. 2009. Compositionality principle in recognition of fine-grained emotions from text. In ICWSM.
  • Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics.
  • Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for Computational Linguistics.
  • Petroni et al. (2015) Fabio Petroni, Luciano del Corro, and Rainer Gemulla. 2015. Core: Context-aware open relation extraction with factorization machines. Assoc. for Computational Linguistics.
  • Polanyi and Zaenen (2006) Livia Polanyi and Annie Zaenen. 2006. Contextual valence shifters. In Computing attitude and affect in text: Theory and applications, pages 1–10. Springer.
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining, pages 995–1000. IEEE.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, Christopher Potts, et al. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer.
  • Tang et al. (2016) Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network. arXiv preprint arXiv:1605.08900.
  • Tang et al. (2014) Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL (1), pages 1555–1565.
  • Turian et al. (2010) Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.

    Word representations: a simple and general method for semi-supervised learning.

    In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
  • Wilson et al. (2009) Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2009. Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis. Computational linguistics, 35(3):399–433.
  • Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.