Visually Explainable Recommendation

01/31/2018 ∙ by Xu Chen, et al. ∙ Duke University Tsinghua University Georgia Institute of Technology Rutgers University NetEase, Inc 0

Images account for a significant part of user decisions in many application scenarios, such as product images in e-commerce, or user image posts in social networks. It is intuitive that user preferences on the visual patterns of image (e.g., hue, texture, color, etc) can be highly personalized, and this provides us with highly discriminative features to make personalized recommendations. Previous work that takes advantage of images for recommendation usually transforms the images into latent representation vectors, which are adopted by a recommendation component to assist personalized user/item profiling and recommendation. However, such vectors are hardly useful in terms of providing visual explanations to users about why a particular item is recommended, and thus weakens the explainability of recommendation systems. As a step towards explainable recommendation models, we propose visually explainable recommendation based on attentive neural networks to model the user attention on images, under the supervision of both implicit feedback and textual reviews. By this, we can not only provide recommendation results to the users, but also tell the users why an item is recommended by providing intuitive visual highlights in a personalized manner. Experimental results show that our models are not only able to improve the recommendation performance, but also can provide persuasive visual explanations for the users to take the recommendations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems have been major building blocks for many online applications, which provide users with personalized suggestions. To capture users’ personalized preferences as comprehensive as possible, recommender systems have integrated a wide range of information sources in the modeling process.

Image – which widely exists in e-commerce, social networks, and many other applications – is one of the most important resources integrated into modern recommender systems. Previous work that leverage images for personalized recommendation usually transform images into embedding vectors, which are then incorporated with collaborative filtering (CF) for improving the performance. For example, McAuley et al (McAuley et al., 2015) adopted neural networks to transform images into feature vectors, and used the vectors for product style analysis and recommendation; He et al (He and McAuley, 2016b) further extended the approach to pair-wise learning to rank for recommendation; Geng et al (Geng et al., 2015) adopted image features for recommendation in a social network setting; and Wang et al (Wang et al., 2017) extracted image features with neural network for point-of-interest recommendation.

Though the recommendation performance has been improved by incorporating image representation vectors extracted with (convolutional) neural networks, the related work has largely ignored an important advantage of leveraging images for recommendation – its ability to provide intuitive visual explanations. This is because by transforming the whole image into a fixed latent vector, the images become hardly understandable for users, which makes it difficult for the model to generate visual explanations to accompany certain recommendations.

Researchers have pointed out long ago that providing appropriate explanations is beneficial to recommender systems in terms of recommendation persuasiveness, satisfaction, effectiveness, and scrutability, etc (Herlocker et al., 2000; Tintarev and Masthoff, 2007). Existing explainable recommendation models usually interpret the recommendations based on user reviews (McAuley and Leskovec, 2013; Zhang et al., 2014; Seo et al., 2017). However, “a picture may paint a thousand words”, textual features may be less intuitive compared with the visual ones. As exampled in Figure 1, the magnified region of the pants image can intuitively tell user A “the color, position, style, … of the waistband”, while describing them in text may cost a lot of words, and fail to provide intuitive understandings.

Motivated by the desire to fill the gap, in this paper, we propose to provide personalized visual explanations with image highlights (see Figure 1) in the context of image-based recommendation. The key building block of our model is the integration of attention mechanism and collaborative filtering. Specifically, we first design a basic method to model user attention on the product images, and use the learned attention weights to provide visual explanations. Then, to capture more comprehensive preferences, we further extend the basic model by introducing user reviews as an additional weak supervision signal. Compared with existing methods, our approaches not only improve the recommendation performance, but also generate visual explanations for the recommended items.

Contributions. In summary, the main contributions of this work include:

  • We propose visually explainable recommendation to explain recommendations from the visual perspective, which, to the best of our knowledge, is the first time in the research of personalized recommendation.

  • We design two types of neural attentive models to discover user visual preference, with the supervision of implicit feedback as well as textual reviews. With the learned attention weights, we can readily generate visually explainable recommendations.

  • We conduct experiments to verify the effectiveness of our proposed models for Top-N recommendation as well as review prediction. Further more, we release a collectively labeled dataset to quantitatively evaluate our generated visual explanations, and also we present example analysis to highlight the intuitions of the visual explanations in a qualitative manner.

In the following part of the paper, we first introduce the related work in section 2, and then provide the problem formalization in section 3. Section 4 illustrates the details of our proposed model, and in section 5, we verify the effectiveness of our approach with experimental results. Conclusions and outlooks of this work are presented in section 6.

2. Related Work

There exist two main research lines related to our work – explainable recommendation and the integration of visual features into recommendation systems. We present brief reviews for these two research lines in the following.

2.1. Explainable Recommendation

Many models have been proposed in the recent years to provide explainable recommendations (Zhang, 2017). In specific, McAuley et al (McAuley and Leskovec, 2013) aligned user (or item) latent factors in matrix factorization (MF) with topical distribution in latent dirichlet allocation (LDA) (Blei et al., 2003) for joint parameter optimization under the supervision of both score ratings and textual reviews, and thus the user preferences are explained by the learned topical distributions. To explain finer-grained user preference, Zhang et al (Zhang et al., 2014) translated user reviews into feature-opinion pairs, and then leveraged multi-matrix factorization to discover user preferences as well as item qualities on the feature-level. Wu et al (Wu et al., 2016)

designed an additive co-clustering model based on Gaussian and Poisson distributions to explain recommendations by jointly optimizing user reviews and ratings, while Heckel et al 

(Heckel et al., 2017) generated interpretable recommendations by identifying overlapping co-clusters of clients and products based on implicit feedback. By modeling aspects in user reviews, He et al (He et al., 2015) devised a graph algorithm called TriRank for providing recommendations with better explainability and transparency, and to leverage user opinions as well as social information, Ren et al (Ren et al., 2017) introduced the concept of viewpoints (a tuple of concept, topic, and sentiment label), and proposed a probabilistic graphical model based on viewpoints to provide explainable recommendations. Seo et al (Seo et al., 2017) utilized attention mechanism to find both local and global preference information in the textual reviews to explain the users’ rating behaviors.

Different from the above methods that are mainly based on user reviews, we explore to capture users’ visual preferences, and provide explainable recommendations from a new visual perspective.

Figure 1. Personalized visual explanations for the recommended items. Here, different parts of the images are magnified in circle to provide intuitive explanations for the corresponding user. Specifically, the pants and T-shirt are both recommended to user A and B. For user A, the waistband of the pants and the cuff of the T-shirt are highlighted to tell the user why these items are recommended, while for user B, the pants turnup and the T-shirt collar are magnified as personalized explanations.

2.2. Image-based Recommendation

Recently, there is a trend to incorporate visual features into the research of personalized recommendation. Specifically, McAuley et al (McAuley et al., 2015) introduced the concept of visual recommendation into e-commerce, and released a large dataset for this task. He et al (He and McAuley, 2016b) represented each product image as a fixed length vector, and infused it into the bayesian personalized ranking (BPR) framwork (Rendle et al., 2009) to improve the performance of Top-N recommendation. To make use of both visual- and textual- features, Cui et al (Cui et al., 2016) integrated the product images and item descriptions together to make dynamic Top-N recommendation. Liu et al (Liu et al., 2017) adopted neural modeling based on product images to model the style of items, which led to improved recommendation performance. Wang et al (Wang et al., 2017) introduced image features into point-of-interest (POI) recommendation, and proposed a graphical framework to model visual content in the context of POI recommendation. Shankar (Shankar et al., 2017) designed a unified end-to-end neural model (named VisNet) to build a large scale visual recommendation system for e-commerce. Chen (Chen et al., 2017) introduced the attention mechanism into CF to model both item- and component-level implicit feedback for multimedia recommendation.

Generally, these methods mainly focus on how to leverage image features to enhance the recommendation performance, however, we want to discover users’ personalized visual preferences, and more importantly, to provide visually explainable recommendations.

3. Problem Definition

Suppose there are users and items in our system. Each item has an image , and we extract item ’s visual features as from leveraging existing methods such as deep convolutional neural networks (Simonyan and Zisserman, 2015), where is a dimensional vector corresponding to a spatial region of image , and is the number of different regions. Let () be the textual review of user on item , where is the -th word, and is length of the review.

Given visual features , user reviews and user-item historical interaction records, our task is to learn a recommendation model to (1) generate top- personalized items as recommendations for a target user, and (2) provide explanations for these recommended items based on . For easy reference, we list the notations used throughout the paper in Table 1.

4. Visually Explainable Recommendation

In this section, we first propose a base model to attentively incorporate image features into collaborative filtering (CF) to provide visually explainable recommendations. And then, an advanced model is designed to leverage user textual reviews to further enhance the recommendation performance as well as interpretability.

(a) Visually explainable collaborative filtering (VECF)
(b) Review-enhanced visually explainable collaborative filtering (Re-VECF)
Figure 2. (a) Our base model. The regional features are attentively merged into an image representation, which is then multiplied with the user embedding to predict the final result. The learned attention weights are used to generate visual explanations. (b) The review enhanced model. User reviews are incorporated into the architecture to provide more informative signals to enhance the recommendation performance as well as the interpretability.

4.1. The Base Model

Intuitively, when browsing a product image, users often pay more attention to the regions related to their interests, while the attention cased on other parts may be relatively less. For example, a user who wants to buy a round neck T-shirt may care more about the collar relevant regions compared with other areas. To model such human senses, we design Visually Explainable Collaborative Filtering (VECF for short) based on attention mechanism to discover user’s region-level preferences, and use the learned attention weights to explain recommendations from the visual perspective.

In the following, we first briefly describe the method for image feature extraction, and then illustrate the model details.

4.1.1. Image feature extraction

Considering the efficiency for practical applications, in our model, we pre-extract the regional visual features of images, which is similar to many previous work (Chen et al., 2017; He and McAuley, 2016b; Liu et al., 2017). Specifically, we feed each product image into the pre-trained VGG-19 (Simonyan and Zisserman, 2015) architecture, and use the output of layer as the final representation. This feature map can be seen as 196 feature vectors () of 512-dimension (), corresponding to 196 different square regions of the image.

This type of pre-processing is essentially equivalent to training an end-to-end model by fixing the pre-trained parameters as in VGG. If more computational resources are available, we can also free the VGG component to achieve a totally end-to-end model for visual feature learning.

Notations Descriptions
The set of users .
The set of items .
The user and item embeddings, and their corresponding dimension.
Regional feature set for item , the dimension of each region feature, the merged region feature for item , and the set of all visual features
The word list in the review of user on item : , the one-hot format of word , and the set of all reviews .
The visual attention weights of user on item ’s product image.
The weighting and bias parameters for visual attention weights.
The hidden states in GRU.
The parameter matrices for the input word embedding of GRU.
The parameter matrices for the hidden state of GRU.
The parameter matrices for the visual features in our revised GRU.

The bias vectors of GRU.

Context vector for generating the -th word in the review of user on item .
The weighting and bias parameters used to derive context vectors.
The word embedding matrix.
The vocabulary size.
The word embedding dimension.
The hidden state dimension.

The sigmoid, hyperbolic tangent and ReLU activation functions.

Table 1. Notations and descriptions.

4.1.2. Visually Explainable Collaborative Filtering (VECF)

The overall design principle of our base model is shown in Figure 2(a). Suppose , are the embeddings of user and item , respectively. The visual feature of is , where is a -dimension vector corresponding to a spatial region of ’s image, and is the number of different regions. We compute image ’s global image feature with:


where is the attention weight jointly determined by the current user and the regional feature , which is:


where , and are the parameters to learn,

is the Rectified Linear Unit (ReLU) 

(Maas et al., 2013) active function.

Given item ’s global image feature , the final item embedding can be computed as:


where is a function that merges the image representation and the item’s latent embedding. Here, we implement it as a simple element-wise multiplication, and other choices including element-wise addition and vector concatenation have also been tested, but they lead to unfavored performance.

When making predictions, we feed the user embedding and the final item embedding into a function:


where is an arbitrary prediction function, or even a prediction neural network as in (He et al., 2017). Here, we choose the sigmoid inner product as a specific implementation, because it gives us better training efficiency on our large-scale data. However, it is not necessarily restricted to this function and many others can be used in practice according to the application domain.

At last, we adapt binary cross-entropy as the loss function for model optimization, and the objective function to be maximized is:


where is the model parameter set, is the ground truth that would be 1 if user purchased item , and 0 otherwise. is the set of all items, and is the set of items that has purchased before. Corresponding to each positive instance, we uniformly sample an instance from the unobserved interactions (i.e., unpurchased items of the user) as the negative instance. It should be noted that a nonuniform sampling strategy might further improve the performance, and we leave the exploration as a future work.

In this equation, we maximize the likelihood of our predicted results with the first two terms, and regularize all of the model parameters to avoid over fitting with the last term. In the training phase, we learn the parameters based on stochastic gradient descent (SGD) optimization. Once the model is optimized, we are not only able to generate a personalized recommendation list for a target user according to the predicted scores (i.e.

), but also can highlight particular regions of the corresponding product image as the visual explanations according to the attention weights (i.e. ), which will be explained in the following parts of the paper.

4.2. The Review-enhanced Model

We have introduced the basic model for visually explainable recommendation based only on the visual images and implicit feedbacks. In many practical systems such as e-commerce, users usually express their opinions in the form of textual reviews. Compared with pure implicit feedback, the textual review signals can be very helpful in our task because: (1) They provide explicit information that reveals user preferences. See the example in Figure 3, although user A and B bought the same top (i.e., both have a positive implicit feedback on the top), the features that they care about can be very different according to the posted reviews. User A cares more about the fitting and the neck opening, while user B is more interested in its quality and the pocket. Therefore, incorporating user reviews in the modeling process can help us to capture more comprehensive user preferences, which may lead to improved recommendation performance; (2) People may directly express their opinions on the visual features through their reviews. In the above example, user A expressed her opinion on the neck of the top in the image by “… Nice wide neck opening, very stylish looking …”. As a result, the textual reviews may exhibit as important signals to identify user preferences in the product image, which may help us to highlight more accurate visual regions tailed for different users, and further generate better visually explainable recommendations.

Motivated by these intuitions, we introduce user reviews as a weak supervision signal into our base model, and proposed review-enhanced visually explainable recommendation (Re-VECF for short) to further enhance the performance as well as the interpretability of the recommendations.

Figure 3. An example of user reviews on Amazon. The pink italic fonts reveal user preferences that can be aligned with the corresponding visual features on the product image.

4.2.1. Textual feature modeling based on gated recurrent unit (GRU) network

Suppose () is the review of user on item , where is the word at time step , and

is the length of the review. To model such textual features, we make use of recurrent neural networks (RNN) 

(Mikolov et al., 2010), which has been successfully applied to a number of language modeling tasks such as machine translation (Bahdanau et al., 2014), image captioning (Vinyals et al., 2015), and video classification (Yue-Hei Ng et al., 2015)

. Specifically, we adopt the gated recurrent unit (GRU) network 

(Cho et al., 2014)

because we find it more computationally efficient in our task than the other RNN variations, such as the long-short term memory (LSTM) network 

(Hochreiter and Schmidhuber, 1997).

In a standard GRU network, the prediction of the current word is conditioned on the previous hidden states as well as the previously generated words. Computations at each time step are:


where is the reset gate and is the update gate. is the word embedding matrix, where is the vocabulary size, and is the one-hot format of the input word . is the hidden state, while and are the parameter matrices, and are the bias vectors. Finally, denotes the element-wise multiplication function, and and are the sigmoid and hyperbolic tangent activation functions, respectively.

4.2.2. Review-enhanced visually explainable collaborative filtering (Re-VECF)

The architecture of our model is shown in Figure 2(b). To provide available signals to discover visual preferences, the prediction of each word is not only influenced by the previous word and the hidden state as in standard GRU, but also linked with the image regional features. More formally, the merged regional feature is added into the reset and update gates to influence the review generation process:


where is the parameter matrix for the visual features. With the help of such a design, the user preference information embedded in the textual features can be leveraged to guide the learning of visual attentions (i.e. ) through back propagation signals, denoted as the red dotted line in Figure 2(b).

For simplicity, we abbreviate the computations from Eq.(10) to (13) as:


and the word at time step can be predicted by:


where is an -way softmax operation, and is the set of all previous words before iteration . Note that when =1, the sequence model has no input information, and we thus only use to derive :


In real-world scenarios, textual reviews may associate with user preferences or item features that are not reflected in the product image. See the example in Figure 3 again, in the review of user B “… Nice quality …”, the feature quality can hardly be expressed in an image. Inspired by this intuition and to make our model more robust, we include user/item embeddings into the word prediction process to capture image-independent factors. Formally, we introduce a gate function to model whether the current word is generated from the image features or the user/item embeddings in a soft manner, and the above computations are thus further improved as:


for the initial state, and for the subsequent states:


where is a context vector used to influence the review generation process.

is the sigmoid function used to weigh the image and user/item embeddings.

is the Rectified Linear Unit (ReLU) (Maas et al., 2013) active function, and are the model parameters. In the computation of , we initialize so that the image embedding and user/item embeddings are equally important.

At last, by simultaneously predicting user implicit feedback and textual reviews, our final objective function to be maximized is:


In this equation, we are formulating our problem into a multi-task learning framework. By jointly capturing the preferences from user implicit feedbacks and textual reviews, we aim to achieve both recommendation accuracy and reasonable high-quality visual explanations.

5. Experiments

In this section, we evaluate our proposed models focusing on the following three key research questions:

RQ 1: Performance of our models for Top-N recommendation.
RQ 2: Performance of our models for review prediction.
RQ 3: Performance of our models for providing visual explanations.

As mentioned before, our final model Re-VECF can be seen as a multi-task learning framework. The first two questions are designed to evaluate each subtask one by one, while the third question aims to study the visual explanations provided by our models. We begin by introducing the experimental setup, and then report and analyze the experimental results to answer these research questions.

5.1. Experimental Setup

Datasets. We conduct our experiments on the Amazon e-commerce dataset111  (He and McAuley, 2016a; McAuley et al., 2015). This dataset contains user-product purchasing behaviors as well as product images and textual reviews from Amazon spanning May 1996 - July 2014. We evaluate our models on the categories of Clothing, Shoes and Jewelry/Men and Clothing, Shoes and Jewelry/Women with the statistics shown in Table 2.

Datasets #Users #Items #Interactions Density #Words
Men 643 2454 6359 0.403% 21600
Women 570 3346 7640 0.401% 17614
Table 2. Statistics of the datasets.

Evaluation methods. We use the following measures for evaluation on different tasks:

Precision (P), Recall (R) and -score (): These measures aim to evaluate the quality of the recommendations (Karypis, 2001)

. In the context of recommendation system, Precision computes the percentage of correctly recommended items in a user’s recommendation list, averaged across all testing users. Recall computes the percentage of purchased items that are really recommended in the list, and it is also averaged across all testing users. By considering both Precision and Recall,

-score computes the harmonic average between them, which is reported in our experiments as the final results.

Hit-Ratio (HR): Hit-ratio gives the percentage of users that can receive at least one correct recommendation, which has been widely used in previous work (Karypis, 2001; Xiang et al., 2010).

NDCG: To assess if the items that a user has actually consumed are ranked in higher positions in the recommendation list, we use normalized discounted cumulative gain (NDCG) to evaluate ranking performance by taking the positions of the correct items into consideration (Järvelin and Kekäläinen, 2000).

ROUGE: ROUGE score (Lin, 2004)

is a widely used metric for evaluating the quality of text generation. It computes the overlapping of n-grams between the generated text and the ground truth. In our model, take 2-grams for example, the predicted review

and the true review are first mapped into 2-gram sets and , respectively. Then the Precision (ROUGE-2-P), Recall (ROUGE-2-R) and -score (ROUGE-2-) are computed as:

(24) ROUGE-2-P

In this work, we report ROUGE score under 1-gram and 2-gram settings, referred to as ROUGE-1 and ROUGE-2, respectively.

Model Reference Information Depth
Top-N recommendation Measures:  , HR and NDCG
BPR (Rendle et al., 2009) - shallow model
VBPR (He and McAuley, 2016b) image shallow model
HFT (McAuley and Leskovec, 2013) text shallow model
NRT (Li et al., 2017) text deep model
JRL (Zhang et al., 2017) image+text deep model
VECF section 4.1 image deep model
Re-CF section 4.2 text deep model
Re-VECF section 4.2 image+text deep model
Review prediction Measures:  ROUGE
NRT (Li et al., 2017) text deep model
Re-CF section 4.2 text deep model
Re-VECF section 4.2 image+text deep model
Visual explanation Measures:   and NDCG
VECF section 4.1 image deep model
Re-VECF section 4.2 image+text deep model
Table 3. Summary of the models in our experiments on the three tasks, respectively, which compares the specific information used in each model and the depth of the models.

Baselines. We adopt the following representative and state-of-the-art methods as baselines for performance comparison:

BPR: The bayesian personalized ranking (Rendle et al., 2009) model is a popular method for top-N recommendation. We adopt matrix factorization as the prediction component for BPR.

VBPR: The visual bayesian personalized ranking (He and McAuley, 2016b) model is a state-of-the-art method for recommendation leveraging product visual images.

HFT: The hidden factors and topics model (McAuley and Leskovec, 2013) is a well known recommendation method leveraging user textual reviews.

NRT: The neural rating regression model (Li et al., 2017) is a state-of-the-art neural recommender which can generate user textual reviews.

JRL: The joint representation learning model (Zhang et al., 2017) is a state-of-the-art neural recommender, which can leverage multi-model side information for Top-N recommendation.

VECF: This is the base model proposed in section 4.1, it integrates visual attention mechanism with collaborative filtering under the supervision of user implicit feedbacks.

Re-CF: This is a variation of our final model in section 4.2. We remove the image feature from the input, and derive an image-free collaborative filtering model supervised by the user implicit feedback and textual reviews.

And our final model for visually explainable recommendation is denoted as Re-VECF. For easy understanding, we summarize the similarities and differences of all the models in our experiments on different tasks in Table 3.

Parameter settings

. We initialize all the optimization parameters according to a uniform distribution in the range of

, and update them by conducting stochastic gradient descent (SGD). We determine the learning rate and the tuning parameter (in Eq.(23)) in the range of and , respectively, where indicates the final values used in our experiments. The dimension of user/item embedding is tuned in the range of . For each dataset, we first tokenize all the user reviews by the Stanford Core NLP tool222

, and then build the lexicon by retaining all the tokens for GRU training. The word embeddings are pre-trained based on the Skip-gram model

333, and the embedding size is set as 64. For the baselines, we determine the optimal parameter settings by grid search, and the models designed for rating prediction (i.e. HFT and NRT) are learned by optimizing the pairwise ranking loss of BPR to model user implicit feedback. When conducting experiments, 70% items of each user are leveraged for training, while the remaining are used for testing. We generate top-5 recommendation list () for each user in the test dataset.

5.2. Top-N Recommendation (Rq1)

In this section, we evaluate our models for the task of Top-N recommendation. Specifically, we first compare our models Re-CF, VECF and Re-VECF with the previously proposed methods (i.e. BPR, VBPR, HFT, NRT, and JRL), and then we study the influence of embedding dimension on the recommendation results.

Model comparison. Table 4 shows the performance of different methods on , HR and NDCG, we can see that,

VBPR, HFT, NRT, Re-CF and VECF can achieve better performance than BPR. On considering that the key difference between BPR and VBPR/HFT/NRT/Re-CF/VECF is that the latter models integrate either visual or textual features into their modeling process, this observation verifies that side information – such as user reviews or product images – can help to improve the performance in real-world systems. Furthermore, by incorporating both visual and textual features together, JRL obtains the best performance among the baselines.

NRT and Re-CF can achieve better performance than HFT on both Men and Women datasets. This is within expectation because 1) the multiple non-linear layers in NRT can be more expressive in terms of user preference modeling compared with the inner product operation in HFT, and 2) NRT and Re-CF can better capture word sequential information than HFT in review sentences. This allows NRT and Re-CF to achieve better precision in textual feature and user profile modeling, which further improves the recommendation performance.

Dataset Men Women
Measure@5(%) HR NDCG HR NDCG
BPR 1.209 3.901 0.740 0.897 3.342 0.611
HFT 1.242 4.243 0.757 0.915 3.371 0.631
VBPR 1.361 4.261 0.773 0.929 3.402 0.648
NRT 1.399 4.469 0.802 0.952 3.527 0.674
Re-CF 1.370 4.364 0.781 0.937 3.451 0.651
VECF 1.378 4.373 0.791 0.948 3.523 0.669
Re-VECF 1.442 4.803 0.846 0.985 3.587 0.712
Table 4. Summary of the performance for baselines and our models. The first block shows the baseline performances, where starred numbers are the best baseline results; the second block shows the results of Re-VECF and its variations. Bolded numbers are the best performance of each column, and all numbers in the table are percentage numbers with ‘%’ omitted. Improvements of our final model Re-VECF from the best baseline are significant at with paired -test.

VECF performs better than Re-CF, and VBPR performs better than HFT. This observation highlights the importance of visual features in personalized recommendation, and it is consistent with the intuition that customers are largely influenced by the product images when making purchasing decisions online, so that images contain rich information about users’ personalized preferneces.

The performance of Re-CF and VECF fail to surpass JRL. This observation is not surprising because JRL takes both review and image features for user/item profiling, while Re-CF and VECF takes only one of the information sources. Encouragingly, we find that our final Re-VECF model achieves better performance than JRL. This result indicates the effectiveness of our method for the Top-N recommendation task, and the main reason can be that when profiling visual features, JRL roughly takes a fixed vector to represent the whole product image, while in our model, the attention mechanism can provide us with the opportunity to discriminatingly focus on the image regions that are more important to the corresponding user, which eventually helps to better capture the user preferences and improve the recommendation performance.

Figure 4. Performance of our models and baselines under difference choices of embedding dimension .

Influence of the embedding dimension . In this section, we study how the embedding dimension influences the model performance. We set all other parameters according to section 5.1, and observe the performance by tuning from 10 to 100 (using even larger values will decrease the performance). From the results on @5 and @5 shown in Figure 4, we can see that on the Men dataset, all the models can reach their best performance when using some small dimensions, while using additional parameters does not help promoting the performance. Similar results can also be observed on the Women dataset. This observation suggests that while expressive power is increased, using too many latent factors may also increase the model complexity extremely and may lead to over-fitting, which can weaken the generalization ability of our models on the test dataset.

5.3. Review Prediction (Rq2)

In this section, we evaluate the second subtask of our model – review generation – by comparing the predicted reviews with the truly posted ones. Specifically, we first conduct quantitative evaluation on the models that can generate user reviews, and then we present intuitive analysis on the predicted reviews in a qualitative manner.

Quantitative evaluation. To begin with, we compare our final model Re-VECF with Re-CF and NRT, where Re-CF is the image-free version of Re-VECF. The model parameters follow the settings in section 5.1, and we conduct this experiment on the Men dataset. From the results on ROUGE-1 and ROUGE-2 shown in Table 5, we can see that Re-VECF achieves significantly better performance than Re-CF and NRT on all the metrics. This is as expected because the product aspects that users comment in textual reviews may be directly aligned with the product images, thus including visual features in the modeling process can effectively capture such signals and make accurate predictions.

Qualitative analysis. For the purpose of providing more intuitions, we also list several examples of the generated user reviews in Table 6. We can see that with the help of gated recurrent units (GRU) for natural language generation, the linguistic quality of the generated reviews from all the models are reasonably good. More encouragingly, Re-VECF can generate words that directly describe (part of) the product image, and some of the words are also mentioned in the true review. As in the boxed areas on the images for example, Re-VECF can generate very explicit descriptive words such as sleeve and buckle by automatically aligning the information between image and text, while Re-CF and NRT only output some very general-purpose expressions. This observation is in line with the quantitative results mentioned above, and it further verifies that by including visual features in the modeling process, Re-VECF has the ability to learn the relationships between visual- and textual- features, and thus to generate descriptive textual expressions for the recommendations and visual images.

Method ROUGE-1 ROUGE-2
P(%) R(%) P(%) R(%)
Re-CF 16.15 37.98 19.11 1.11 3.73 1.61
NRT 18.67 41.28 21.77 1.42 4.12 2.01
Re-VECF 22.01 48.36 27.49 1.68 4.78 2.32
Table 5. Performance comparison between Re-CF, NRT and Re-VECF on the task of review prediction. Improvements of Re-VECF from baselines are significant at level.
Image True Review Re-VECF Re-CF NRT
It’s an excellent poplin solid color long sleeved shirt Much like the sleeve Not bad for the price Very good choice
Very good-looking sturdy belt with a good ribbed weave and strong buckle I like this good looking buckle Great for the price Makes a great price
Table 6. Examples of the generated reviews compared with the true reviews. The bolded italic words (e.g., sleeve) mean that the word generated by Re-VECF was also mentioned in the true review, and the word is aligned to the boxed area of the image learned by the attention mechanism in our model.
# Target Item Historical Records Textual Review Visual Explanation
this is a large watch… nearly as large as my suunto but due to its articulated strap it fits on the wrist very well.
this is a really comfortable v-neck. i found that the size and location of the v are just right for me. i’m 5’8 & #34, but 200 lbs ( and dropping :) )
Great leggings. perfect for fly fishing or hunting or running. just perfect anytime you are cold!
The socks on the shoes are a perfect fit for me. first time with a shoe with the speed laces and i like them a lot
Really like these socks! they are really thick woolen socks and are good for cold days. they cover a good portion of your feet as they go a little (halfway) above the calf muscle area.
I like the front pocket! Very cool!
Table 7. Examples of the visual explanations, where each row represents the target item of a user. The first column lists the image of the target item, and in the second column we list two most similar products to the target item that the user purchased before. The third column shows the user’s review on the target item, and in the last two columns, we compare the highlighted regions provided by VECF and Re-VECF for the target item. In the review column, we use bolded italic to highlight the part of user review that our generated visual explanations correspond to.

5.4. Visual Explanation (Rq3)

In this section, we evaluate whether the visual explanations generated by our model are reasonable, i.e., whether the highlighted regions of the image learned by our model really reveal a user’s potential interests on the recommended item. Similarly, we also conduct quantitative analysis first based on a dataset with collectively labeled ground-truth. Then, to provide better intuitions for the generated visual explanations, we present and analyze several examples learned by the model in a qualitative manner.

Quantitative evaluation. To the best of our knowledge, this work is the first one on visually explainable recommendation, and there is no publicly available dataset with labeled ground-truth to evaluate whether the visual explanations (i.e., the visual highlights) generated by our model are reasonable or not. To tackle the problem, we build a collectively labeled dataset with Amazon MTurk by asking the workers to identify the image regions that may explain why a user bought a particular item, based on the user’s previous purchase records and his/her review written on the target item.

More specifically, we still adopt the Amazon dataset, and retain the top-100 most active users (i.e., users with the most purchasing records) in the dataset. These users are provided to MTurk workers for labeling, so that the workers have sufficient historical information about a user to understand the user’s personalized interests when labeling an image for the user.

For each of the 100 users, we randomly select one item that the user purchased before as the target item to label, and the image of the target item is equally divided into square regions. A label task for a worker is to identify 5 out of the 25 regions that the worker believes are most relevant to the user’s preference. For each label task, we provide the following two information sources to the worker for reference:

  • Images and the corresponding reviews of the products that the user previously purchased.

  • The user’s review on the target item to be labeled.

In a label task, a worker is first required to read the image-review pairs of the user’s previously purchased products (around 10 pairs). After that, the worker will be shown the target image as well as the corresponding review, and be asked to identify 5 regions of the image. In this way, the worker can understand the user’s personalized preference through the user reviews, and then identify the relevant image regions based on both the user preference and the user’s review on the target item.

#Users #Items #identified regions #regions/image
94 94 220 2.34
Table 8. Basic statistics of the labeled dataset.
Method Top-5 Top-10
Random 3.22 8.24 7.41 11.46
VECF 6.70 17.37 10.38 16.40
Re-VECF 8.35 20.53 12.99 19.95
Table 9. Performance comparison between VECF and Re-VECF on visual explanation task by identifying top-5,10 relevant regions out of 196 candidate regions. Improvements of Re-VECF from VECF are significant at level.

Finally, each target item is labeled by two workers, and we only retain the common regions identified by both workers as the ground-truth, thus the final number of regions for an image may be less than 5. Some basic statistics of the labeled dataset are shown in Table 8. Note that the final number of users and items are less than 100 because there are 6 target items for which the workers have no commonly identified region.

For evaluation, we compare VECF and Re-VECF as no other models can provide visual explanations, and the model parameters follow the settings in section 5.1. Because both VECF and Re-VECF models work on image features, we use each model to identify the top-5 and 10 regions out of the 196 candidate regions according to the learned attention weights (), and an identified region by the algorithm is considered correct if it falls into the human-labeled regions. The results by comparing our predicted regions on the ground-truth are shown in Table 9.

It should be noted that selecting top-5 and 10 regions out of 196 candidates itself pose a difficult problem as a ranking task, which is shown by the inferior performance of a randomized selection. By automatic attentive learning over the images, the VECF model gains significant improvements, and by further introducing user reviews as a weak supervision signal, our final Re-VECF model generates much more accurate visual explanations, which verifies that the review information plays an important role in aligning the textual- and visual- features to generate visual explanations.

Qualitative analysis. Explainability of recommendations is often assessed qualitatively (Wu et al., 2016; Ren et al., 2017; Wang and Blei, 2011; He et al., 2015), to provide more intuitive analysis here, we also evaluate our generated visual explanations in a similar manner. To compare VECF and Re-VECF, we present their generated visual explanations on the same product in the testing dataset, and the parameters follow the default settings as described in section 5.1. The highlighted regions of a product image are determined by the learned weights in Eq.(2). Examples are presented in Table 7. From the results we have the following observations.

Our models can provide meaningful explanations. In Case 6, for example, the pocket of the shirt was highlighted by Re-VECF, and in Case 4, VECF labeled the toe of the shoe. These fashion elements are in accordance with the products that the user has purchased.

In Case 2, although the T-shirts in the historical records are different in many aspects such as color, style, etc, Re-VECF successfully captured their essential similarity – v-neck, which is highlighted as the visual explanation. This implies the capability of our model to discover users’ visual preferences from images and reviews.

In Case 2 and 6, Re-VECF highlighted different components (collar and pocket) of similar items (shirt) for different users. This manifests that our provided visual explanations are personalized, which verifies the effectiveness of our designed user-aware attention mechanism as shown in Eq.(2).

By comparing the highlighted regions with the user’s reviews, we see that Re-VECF tends to highlight more accurate image regions than VECF. In Case 2 for example, the user praised the collar of the shirt by “… this is a really comfortable v-neck …”, and Re-VECF successfully labeled the neck regions as visual explanation, while VECF highlighted the sleeve of the shirt. Other cases also imply the superiority of Re-VECF against VECF in terms of visual explanation.

These observations further verified that the review information leveraged in Re-VECF provides very informative user preferences to better supervise the learning of visual attentions, and thus generates more accurate visual explanations.

6. Conclusions

In this paper, we propose visually explainable recommendation, aiming to make recommender systems explainable from the visual perspective. To achieve this goal, we proposed two attentive architectures with the supervision of user implicit feedback as well as textual reviews to capture users’ visual preferences. Extensive experiments verified that our models were not only able to provide accurate recommendations and review predictions, but also can provide reasonable visual explanations for the recommended items.

This is a first step towards our goal for visually explainable recommendation, and there is much room for further improvements. For example, we can integrate probabilistic graphical models with neural modeling to introduce different empirical prior distributions for more accurate visual preference discovery. It will also be interesting to leverage eye-tracking devices to align users’ eye attention with the model-learned attention on visual images for visually explainable recommendation. Beyond e-commerce, we will also investigate visually explainable recommendation in other image-related recommendation scenarios, such as social image recommendation in Instagram or Pinterest, or even multimedia recommendation for stream videos.


  • (1)
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR (2003).
  • Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In SIGIR.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
  • Cui et al. (2016) Qiang Cui, Shu Wu, Qiang Liu, and Liang Wang. 2016. A Visual and Textual Recurrent Neural Network for Sequential Prediction. arXiv preprint arXiv:1611.06668 (2016).
  • Geng et al. (2015) Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. 2015. Learning image and user features for recommendation in social networks. In ICCV.
  • He and McAuley (2016a) Ruining He and Julian McAuley. 2016a. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW.
  • He and McAuley (2016b) Ruining He and Julian McAuley. 2016b. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback.. In AAAI.
  • He et al. (2015) Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. Trirank: Review-aware explainable recommendation by modeling aspects. In CIKM.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW.
  • Heckel et al. (2017) Reinhard Heckel, Michail Vlachos, Thomas Parnell, and Celestine Dunner. 2017. Interpretable recommendations via overlapping co-clusters. IEEE International Conference on Data Engineering (ICDE) (2017).
  • Herlocker et al. (2000) Jonathan L Herlocker, Joseph A Konstan, and John Riedl. 2000. Explaining collaborative filtering recommendations. In CSCW.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
  • Järvelin and Kekäläinen (2000) Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. In SIGIR.
  • Karypis (2001) George Karypis. 2001. Evaluation of item-based top-n recommendation algorithms. In CIKM.
  • Li et al. (2017) Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural Rating Regression with Abstractive Tips Generation for Recommendation. In SIGIR.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL workshop.
  • Liu et al. (2017) Qiang Liu, Shu Wu, and Liang Wang. 2017. DeepStyle: Learning User Preferences for Visual Recommendation. In SIGIR.
  • Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML.
  • McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Recsys.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In Interspeech.
  • Ren et al. (2017) Zhaochun Ren, Shangsong Liang, Piji Li, Shuaiqiang Wang, and Maarten de Rijke. 2017. Social collaborative viewpoint regression with explainable recommendations. In WSDM.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI.
  • Seo et al. (2017) Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu. 2017. Interpretable Convolutional Neural Networks with Dual Local and Global Attention for Review Rating Prediction. In Recsys.
  • Shankar et al. (2017) Devashish Shankar, Sujay Narumanchi, HA Ananya, Pramod Kompalli, and Krishnendu Chaudhury. 2017. Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce. arXiv preprint arXiv:1703.02344 (2017).
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. ICLR (2015).
  • Tintarev and Masthoff (2007) Nava Tintarev and Judith Masthoff. 2007. A survey of explanations in recommender systems. In Data Engineering Workshop.
  • Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
  • Wang and Blei (2011) Chong Wang and David M Blei. 2011. Collaborative topic modeling for recommending scientific articles. In KDD.
  • Wang et al. (2017) Suhang Wang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu. 2017. What your images reveal: Exploiting visual contents for point-of-interest recommendation. In WWW.
  • Wu et al. (2016) Chao-Yuan Wu, Alex Beutel, Amr Ahmed, and Alexander J Smola. 2016. Explaining reviews and ratings with paco: Poisson additive co-clustering. In WWW.
  • Xiang et al. (2010) Liang Xiang, Quan Yuan, Shiwan Zhao, Li Chen, Xiatian Zhang, Qing Yang, and Jimeng Sun. 2010. Temporal recommendation on graphs via long-and short-term preference fusion. In SIGKDD.
  • Yue-Hei Ng et al. (2015) Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR.
  • Zhang (2017) Yongfeng Zhang. 2017. Explainable Recommendation: Theory and Applications. arXiv preprint arXiv:1708.06409 (2017).
  • Zhang et al. (2017) Y Zhang, Q Ai, X Chen, and W Croft. 2017. Joint representation learning for top-n recommendation with heterogeneous information sources. CIKM (2017).
  • Zhang et al. (2014) Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014.

    Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In