Review Helpfulness Prediction with Embedding-Gated CNN

08/29/2018 ∙ by Cen Chen, et al. ∙ 0

Product reviews, in the form of texts dominantly, significantly help consumers finalize their purchasing decisions. Thus, it is important for e-commerce companies to predict review helpfulness to present and recommend reviews in a more informative manner. In this work, we introduce a convolutional neural network model that is able to extract abstract features from multi-granularity representations. Inspired by the fact that different words contribute to the meaning of a sentence differently, we consider to learn word-level embedding-gates for all the representations. Furthermore, as it is common that some product domains/categories have rich user reviews, other domains not. To help domains with less sufficient data, we integrate our model into a cross-domain relationship learning framework for effectively transferring knowledge from other domains. Extensive experiments show that our model yields better performance than the existing methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Product reviews, primarily texts, are an important information source for consumers to make purchase decisions. Hence, it makes great economical sense to quantify the quality of reviews and present consumers more useful reviews in an informative manner. Growing efforts from both academia and industry have been invested on the task of review helpfulness prediction  [Martin and Pu2014, Yang et al.2015, Yang et al.2016, Liu et al.2017a].

Pioneering work hypothesizes that helpfulness is an underlying property of the text, and uses handcrafted linguistic features to study it. For example, [Yang et al.2015] and [Martin and Pu2014] examined semantic features like LIWC, INQUIRER, and GALC. Subsequently, aspect- [Yang et al.2016] and argument-based [Liu et al.2017a] features are demonstrated to improve the prediction performance.

Inspired by the remarkable performance of Convolutional Neural Networks (CNNs) on many tasks in natural language processing, here we employ CNN for review helpfulness prediction task. To better enhance the performance of a vanilla CNN over this task, besides

word-level representation, we further leverage multi-granularity information, i.e., character- and topic-level representations. Character-level representations are notably beneficial for alleviating the out-of-vocabulary problem [Ballesteros et al.2015, Kim et al.2016, Chen et al.2018], while aspect distribution provides another semantic view on the words [Yang et al.2016].

One research question here is whether embeddings shall be treated equally in the CNN. Intuitively, different words contribute to the helpfulness of a review in different intensity or importance levels. For example, descriptive or semantic words (such as “great battery life” or “versatile function”) are more informative than general background words like “phone”. Correspondingly, we propose a mechanism called word-level gating to weight embeddedings111 The gates are applied over all three types of word representations (i.e., character-, word-, and topic-based) for all words.

. Gating mechanisms have been commonly used to control the amount that a unit updates the activation or content in recurrent neural networks 

[Chung et al.2014]. Our word-level gates can be automatically learned in our model and help differentiate the important and non-important words. The resulting model is referred to as Embedding-Gated CNN (EG-CNN).

A gating mechanism empowers CNN in two folds. First, extensive experiments show that our proposed EG-CNN model greatly outperforms hand-crafted features, ensemble models of hand-crafted features, and vanilla CNN models. Second, such gating mechanism selectively memorizes the input representations of the words and scores the relevance/importance of such representations, to provide insightful word-level interpretations for the prediction results. The greater a gate weights, the more relevant the corresponding word is to review helpfulness.

It is common that some product domains/categories have rich user reviews while other domains do not. For example, the “Electronics” domain from Review Dataset [McAuley and Leskovec2013] has more than 354k labeled reviews, while “Watches” has less than 10k. Exploiting cross domain relationships to systematically transfer knowledge from related domains with sufficient labeled data will benefit the task on domains with limited reviews. It is worth noting that, existing studies on this task only focus on a single product category or largely ignore the inner correlations between different domains. In previous work, some features are domain-specific while others can be shared. For example, image quality features are only useful for cameras [Yang et al.2016], while semantic features and argument-based features are applicable to all domains [Yang et al.2015, Liu et al.2017a].

While there are some common practices, such as using a shared neural network, to transfer knowledge between domains [Mou et al.2016, Yang et al.2017], domain correlations

must be established before the knowledge can be transferred properly in our task. Otherwise, transferring the knowledge from a wrong source domain may backfire. We thus provide a holistic solution to both domain correlation learning and knowledge transfer learning by incorporating a domain relationship learning module in our framework. Experiments show that our final model can correctly tap into domain correlations and facilitate the knowledge transfer between correlated domains to further boost the performance.

The rest of the paper is organized as follows. Section 2 formally defines the problem and presents our model. Section 3 showcases the effectiveness of the proposed model in the experiments. Section 4 presents related work, and finally Section 5 concludes our paper.

2 Model

We define review helpfulness prediction as a regression task to predict the helpfulness score given a review. The ground truth of helpfulness is determined using the “a of b approach”: a of b users think a review is helpful.

Formally, we consider a cross-domain review helpfulness prediction task where we have a set of labeled reviews from a set of source domains and a target domain. We seek to transfer the knowledge from other domains with rich data to a target domain. For a review , our goal is to predict its helpfulness score , where is the domain label indicating which domain the data instance is from.

2.1 Word, Character, and Aspect Representations

A review consists of a sequence of words, i.e., . Following the CNN model in [Kim2014], for words in a review , we first lookup the embeddings of all words from a embedding matrix where is vocabulary size and is embedding dimension, and . This word embedding matrix is then fed into a convolutional neural network to obtain an output representation. This is a typical word embedding based model.

In many applications, such as text classification [Bojanowski et al.2016] and machine reading comprehension [Seo et al.2016], it is beneficial to enrich word embeddings with subword information. Inspired by that, we consider to use a character embedding layer to obtain character embeddings to enrich word representations. Specifically, the characters of the -th word

are embedded into vectors and then fed into another convolutional neural network to obtain a fixed-sized vector


A recent work in [Yang et al.2016] shows that extracting the aspect/topic distribution from the raw textual contents does help the task of review helpfulness prediction. The reason is that many helpful reviews tend to talk about certain aspects, like ‘brand’, ‘functionality’, or ‘price’, of a product. Inspired by this, we enrich our word representations by aspect distributions. We adopt the model in [Yang et al.2016] to learn aspect-word distribution , where is aspect size and is the size of vocabulary. A word-aspect representation is obtained by row-wise normalization of the matrix . Then for each word in input review , we obtain aspect representation by looking up the matrix to get .

Formally, for an input review , we obtain its representation as:


where , , and represent word-level, character-level, and topic-level representations respectively, and is a stacking operator. Note that ( in this paper) is the sentence length limit. Sentences shorter than

words will be padded while sentences longer than

words will be truncated.

2.2 Embedding-gated CNN (EG-CNN)

Because some words play more important roles in review helpfulness prediction, for example, descriptive or semantic words (such as “great battery life” or “versatile function”) will be more informative than general background words like ‘phone’. Hence, we propose to weight the input word embeddings. Specifically we propose a gating mechanism to weight each word in our model. The word-level gate is obtained by feeding the input embeddings to a gating layer, where the gating layer is essentially a fully-connected layer with weight and bias .

Formally, for input , we obtain its representation as follows:



is a sigmoid activation function.

Next, we stack a 2-D convolutional layers and a 2-D max-pooling layers on the matrix

to obtain the hidden representation

. Multiple filters are used here. For each filter, we obtain a hidden representation:

where is window size, is embedding dimension, is channel size, and represents a convolutional layer followed by a max-pooling layer. All the representations will then be concatenated to form the final representation . We refer our base model as Embedding-Gated CNN (EG-CNN), where EG-CNN learns a hidden feature representation for an input .

2.3 Cross-Domain Relationship Learning

If we treat all the domains as the same domain, we can build an unified model for our task. Specifically, our target here is to optimize the following objective:


where is the output layer, is the input from domain , is the corresponding label, is a regularization term.

The formulation in Eqn. (5) is limited because it does not take the difference of domains into consideration. To utilize the multi-domain knowledge, we convert the method above to a multi-domain setting where we assume an output layer for each domain . While still a unified model to learn universal feature representation, our new approach has two output layers and to model domain commonalities and differences respectively.

Furthermore, we explicitly model a domain correlation matrix , where is the correlation between domains and . Following the matrix-variate distribution setting from [Zhang and Yeung2010], our objective is to optimize the trace of the matrix product . This shows, when domain and domain are close, i.e., is close to , the model tends to learn a large in order to minimize the trace. In all, our objective is defined as follows:


where gets the trace of a matrix, is a regularization term, and are weight coefficients.

Our final model is presented in Figure 1, where we use EG-CNN as our base model, and further consider cross-domain correlation and multiple domain training. Note that, if we set

as an identity matrix (no domain correlation) and

(no shared output layer), the multi-domain setting is degenerated to a fully-shared setting in [Mou et al.2016]. The limitation of the fully-shared setting is that it ignores domain relationships. However, in practise, we may think “Electronics” is helpful to “Home” and “Cellphones” domains, but may not be so helpful as for “Watches” domain. With our model, we seek to automatically capture such domain relationships and use that information to help boost model performance.

Figure 1: Our final model with cross-domain relationship learning.

3 Experiments

Reviews from 5 different categories in the public Amazon Review Dataset [McAuley and Leskovec2013] are used in the experiments. Data statistics are summarized in Table 1.

General category

# of reviews ( 5 votes)

# of reviews

Watches 9,737 68,356
Cellphones (Phones) 18,542 78,930
Outdoor 72,796 510,991
Home 219,310 991,784
Electronics (Elec.) 354,301 1,241,778
Table 1: Amazon reviews from 5 different categories.

In our model, we initialize the lookup table with pre-train word embeddings from GloVe [Pennington et al.2014] with . For aspect representations, we adopt the settings from [Yang et al.2016]

to set the topic size to 100. For EG-CNN, the activation function is ReLU, the channel size is set to 128, and AdaGrad 

[Duchi et al.2011] is used in training with an initial learning rate of 0.08.

Following the previous work, all experiment results are evaluated using correlation coefficients between the predicted helpfulness score and the ground truth score. The ground truth scores are computed by “a of b approach” from the dataset, indicating the percentage of consumers thinking a review as useful.

3.1 Comparison with Linguistic Feature Baselines and CNN Models

Our proposed EG-CNN model is compared with the following baselines:

  • STR/UGR/LIWC/INQ/ASP: Five regression baselines that use handcrafted features such as “STR”, “UGR”, “LIWC”, “INQ” [Yang et al.2015] and aspect-based features “ASP” [Yang et al.2016]”, respectively;

  • Fusion: ensemble model with “STR”, “UGR”, “LIWC”, and “INQ” features [Yang et al.2015]

  • Fusion : Fusion with additional “ASP” features [Yang et al.2016].

  • CNN: the vanilla CNN model [Kim2014] with word-level embedding only;

  • CNN: the vanilla CNN model with character-based representation  [Chen et al.2018];

  • CNN: the vanilla CNN model with character- and topic- based representations.

  • EG-CNN: our final model with word-level, character-level, and topic-level representations in a gating mechanism.

Watches Phone Outdoor Home Elec.
STR 0.276 0.349 0.277 0.222 0.338
UGR 0.425 0.466 0.412 0.309 0.355
LIWC 0.378 0.464 0.382 0.331 0.400
INQ. 0.403 0.506 0.419 0.366 0.405
ASP 0.406 0.437 0.385 0.283 0.406
Fusion 0.488 0.539 0.497 0.432 0.484
Fusion 0.493 0.550 0.501 0.436 0.491
CNN 0.480 0.562 0.501 0.459 0.524
CNN 0.495 0.566 0.511 0.464 0.521
CNN 0.497 0.567 0.524 0.476 0.537
EG-CNN 0.515 0.585 0.555 0.541 0.544
Table 2: Comparison with linguistic feature baselines and CNN models.

Table 2 shows several interesting observations that validate our motives behind this work. First, all the CNN based models consistently outperform non-CNN models, indicating their expressiveness over handcrafted features. Second, CNN outperforms CNN when data is relatively insufficient (e.g., the domains “Watches” and “Phones”) and loses its edge on domains of abundant data (e.g., the domain “Electronics”). This is because when data size is smaller, the out-of-vocabulary problem (OOV) is more severe, and character-based representation is more beneficial. Third, CNN consistently outperforms CNN, showing that adding topic-based representations can further help the task. Last but not least, our proposed EG-CNN outperforms all CNN variants, which justifies the necessity of adding embedding gates. This further supports the importance of considering embedding gates. In all cases, EG-CNN significantly outperforms the baselines and yields better results than all CNN variants.

3.2 Comparison with Cross-domain Models

To evaluate the effectiveness of our domain relationship learning, we compare our proposed full model against the following two baselines: the target-only model that uses only data from the target domain, and the fully-shared model that uses a fully shared neural network [Mou et al.2016] for all domains. All three models use EG-CNN as the base model.

Watches Phones Outdoor Home Elec.
Target-only 0.515 0.585 0.555 0.541 0.544
Fully-shared 0.522 0.580 0.551 0.518 0.534
Ours 0.535 0.592 0.561 0.544 0.548
Table 3: Comparison with target-only and fully-shared models. Note that both fully-shared and our model use the multi-task setting, where all domains are jointly modeled and learned.

In all experiments, our model consistently achieves better results than both target-only and fully-shared models, supporting the effectiveness and benefit of cross-domain relationship learning. The improvement is greater on domains with fewer labeled data, e.g., the “Watches” domain. The “Watches” domain has the least number of reviews and our model shows the most improvement there.

Interestingly, the fully-shared model performs much worse than the target-only model in the “Home” domain. This might be justified by the potential domain shift, under which the fully-shared model may not perform better than the target-only model. Because some domains are related while some others are not, incorporating data from those less related can hardly help, especially when the target domain (such as “Home”) has sufficient data for the target-only model to perform well enough.

4 Related Work

Recent studies on review helpfulness prediction extract handcrafted features from the review texts. For example, [Yang et al.2015] and [Martin and Pu2014] examined semantic features like LIWC, INQUIRER, and GALC. Subsequently, aspect- [Yang et al.2016] and argument-based [Liu et al.2017a] features are demonstrated to improve the prediction performance. These methods require prior knowledge and human effort in feature engineering and may not be robust for new domains. In this work, we employ CNNs [Kim2014, Zhang et al.2015]

for the task, which is able to automatically extract deep features from raw text content. As character-level representations are notably beneficial for alleviating the out-of-vocabulary 

[Ballesteros et al.2015, Kim et al.2016], while aspect distribution provides another semantic view on words [Yang et al.2016], we further enrich the word representation of CNN by adding multi-granularity information, i.e., character- and aspect-based representations. As different words may play different importance on the task, we consider to weight word representations by adding word-level gates. Gating mechanisms have been commonly used in recurrent neural networks to control the amount of unit update the activation or content and have demonstrated to be effective [Chung et al.2014, Dhingra et al.2016]. Our word-level gates help differentiate important and non-important words. The resulting model, referred to as embedding-gated CNN, has shown to significantly outperform the existing models.

It is common that some domains have rich user reviews while other domains may not. To help domains with limited data, we study cross-domain learning (transfer learning [Pan and Yang2010] or multi-task learning [Zhang and Yang2017]

) for this task. Transfer learning and multi-task learning have been extensively studied in the last decade. With the popularity of deep learning, a great amount of Neural Network (NN) based methods are proposed for TL 

[Yosinski et al.2014]. A typical framework is to use a shared NN to learn shared features for both source and target domains [Mou et al.2016, Yang et al.2017]. Another approach is to use both a shared NN and domain-specific NNs to derive shared and domain-specific features [Liu et al.2017b]. A multi-task relationship learning method is introduced in [Zhang and Yeung2010], which is able to uncover the relationship between domains. Inspired by this, we adopt the relationship learning module to our EG-CNN framework to help model the correlation between different domains.

To the best of our knowledge, our work is the first to propose gating mechanism in CNN and to study cross-domain relationship learning for review helpfulness prediction.

5 Conclusion

In this work, we tackle review helpfulness prediction using two new techniques, i.e., embedding-gated CNN and cross-domain relationship learning. We built our base model on CNN with word-, character- and topic-based representations. On top of this model, domain relationships were learned to better transfer knowledge across domains. The experiments showed that our model significantly outperforms the state of the art.