Do Co-purchases Reveal Preferences? Explainable Recommendation with Attribute Networks

08/16/2019 ∙ by Guannan Liu, et al. ∙ Beihang University 0

With the prosperity of business intelligence, recommender systems have evolved into a new stage that we not only care about what to recommend, but why it is recommended. Explainability of recommendations thus emerges as a focal point of research and becomes extremely desired in e-commerce. Existent studies along this line often exploit item attributes and correlations from different perspectives, but they yet lack an effective way to combine both types of information for deep learning of personalized interests. In light of this, we propose a novel graph structure, attribute network, based on both items' co-purchase network and important attributes. A novel neural model called eRAN is then proposed to generate recommendations from attribute networks with explainability and cold-start capability. Specifically, eRAN first maps items connected in attribute networks to low-dimensional embedding vectors through a deep autoencoder, and then an attention mechanism is applied to model the attractions of attributes to users, from which personalized item representation can be derived. Moreover, a pairwise ranking loss is constructed into eRAN to improve recommendations, with the assumption that item pairs co-purchased by a user should be more similar than those non-paired with negative sampling in personalized view. Experiments on real-world datasets demonstrate the effectiveness of our method compared with some state-of-the-art competitors. In particular, eRAN shows its unique abilities in recommending cold-start items with higher accuracy, as well as in understanding user preferences underlying complicated co-purchasing behaviors.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems play indispensable roles in contemporary e-commerce by empowering consumers to reach their preferred products more efficiently (Adomavicius and Tuzhilin, 2005). Traditionally, recommendation methods rely on the similarity between items and recommend the most “similar” items in terms of users’ historical purchasing records. Particularly, in item-based collaborative filtering (CF) (Sarwar et al., 2001; Kabbur et al., 2013), the similarity between items can arise from the co-purchase relationships between items, i.e., two items would be regarded similar if they have been co-purchased by many other users in history.

The underlying factors that drive the co-purchases of items, however, may not be uniform for different users, and thus such co-purchases would not necessarily reveal users’ genuine preferences. In reality, users may have their own desired feature aspects when considering to buy an item. For example, two movies may be co-watched by users, but some users may only be driven by the same leading actor, while others may be in favor of the same director. To infer users’ underlying individual preferences from co-purchased items, however, have not been well addressed in prior studies. This is also closely related to the concept ofexplainable recommendations, which has absorbed great research interests and become extremely desired in e-commerce (Zhao et al., 2014; Zhang and Chen, 2018).

In this paper, we aim at integrating items’ co-purchase information with items’ attribute information to deep learn users’ personalized interests. Based on the item correlations formed in co-purchases, we propose a novel item network structure called attribute network. An attribute network is essentially a customized sub-network of the item co-purchase network, formed by keeping only the connections between two items that have the given item attribute in both, e.g., two movies with a same director. By decomposing a co-purchase network into various attribute networks, we can specify the diverse driving forces underlying the co-purchase network, which can then be used to characterize a user’s individual preference and explain which aspects of a recommended item that he/she would desire most, i.e., generate explainable recommendations. For those items that have never been purchased and thus cannot enter the co-purchase network, they can still enter attribute networks as long as they have attribute information, which sheds light on cold-start recommendations.

We then propose a novel method for Explainable Recommendation based on Attribute Networks (eRAN). In eRAN, items in an attribute network is represented by low-dimensional embedding vectors through a deep autoencoder to account for the nonlinearity and higher-order proximity in the network. Meanwhile, with users mapped to embedding vectors, an attention mechanism in adopted in eRAN to model user preferences toward different attributes. Then, a personalized item representation can be constructed by taking a weighted average from the node embeddings of each attribute network. Under the assumption that a user’s co-purchased items should be more similar than others, eRAN optimizes towards the contrastive similarity to derive personalized similarity between item pairs, which is different from traditional item-to-item methods that directly factorize the item ratings through aggregate similarity (Kabbur et al., 2013). With the learned parameters, recommendation score of each item based on the users’ embeddings and personalized item representations can be obtained and the most similar items in the lens of individual users would be recommended. Meanwhile, eRAN can provide fine-grained explanations on users’ desired attributes for a particular recommended item.

We conduct experiments on three real-world datasets including movies, books, and music to validate the effectiveness of the proposed methods, where the attributes can directly influence users’ experiences towards these items. Experimental results demonstrate the superiority of the proposed methods to the state-of-the-art recommendation methods in terms of accuracy. Also, considering the incorporated auxiliary item attributes, the method is capable of coping with cold-start items when the attributes are given, and we further implement experiments to show that our method can better predict which users would be interested in these new items. Last but not least, we showcase a scenario how the attention weights and user embeddings inferred from the model can guide the explanations for recommendations.

2. Attribute Network

Figure 1. Toy example for attribute network.

Items can be inter-connected with each other from various perspectives. For example, different items can appear in a same user’s purchasing history, and the connections between the paired items can be driven by users’ tastes and preferences. Also, connections can be established between items with similar attributes since they may both satisfy users’ specific needs. Such relationships can indeed be utilized to recommend items to individual users in a personalized manner with explanations on particular aspects.

One prominent strategy in constructing item relationships is to learn item similarity by factoring user’s rating for an item as an aggregation of its similarity with previously rated items, or can be referred to as neighborhood items (Kabbur et al., 2013). The relationships derived from item similarity can provide possible explanations for recommendations, e.g., “item A is recommended because it is similar to the previously bought item B”. Such relationships remain the same across users, i.e., every user would regard the similarity between each pair of items with no differences, which however, may not always hold and refrain the relationships to distinguish fine-grained user preferences.

For example, the items are connected differently with regards to the graph structure as shown in Fig. 1, where A has more connections with C, D, and E that have similar actors, while B interacts more with F and G due to their common directors. Two users X and Y may have watched both movies A and B, but the underlying reasons may be different. User X may only prefer the actors, while Y may be driven by the directors. Thus, the relationships between A and B should be decomposed in accordance with users’ preferences on particular aspects, i.e., the similarity between items should not be treated uniformly for every user. In particular, for users such as X that favors the movies with specific actors, they may be more likely to accept the recommended item C or D, rather than F or G with the following explanations, “We recommend C because it has similar actors with the watched movie A”.

Obviously, item relationships can be decomposed by incorporating auxiliary information such as item attributes. With the attributes attached to each item, users’ personalized preferences toward particular relationships can be explicitly explained. If the items with similar attributes are connected as shown in Fig. 1, users’ preferences can propagate along the connections driven by particular attributes. For example, user X’s preferences toward actors can manifest in the local connections driven by actor-network, while Y’s favor for directors can be disclosed through the director-network.

In addition, traditional content-based methods generally treat the item attributes independently and compute the similarity between items, which however, would lose higher-order relationships between them. Take the items again in Fig. 1, item A and item D share no common actors, and the similarity based on the attribute of actors would be scored 0 in traditional sense. However, we can observe that A has a same actor with B, and B also has a same actress with C, which can indicate proximity between A and C.

Therefore, in order to capture users’ preferences toward attributes and also account for and the higher-order relationships arise from particular attribute space, we propose a novel network structure, namely attribute network. We firstly construct a co-purchased network from the rating/purchasing history denoted by , where each item is regarded as a node, and “also buy item with ” is termed as a link between nodes and , . In particular, item has a -dimensional attribute vector , and for each type of attribute , we only reserve the edges in that share the same attribute values between the pairwise nodes, with the subset of links being , which gives an induced subgraph of , i.e., -attribute network .

We assume that the reason why items are co-purchased can be attributed to one or several attributes. This assumption is indeed more applicable for experience products such as movies, music, etc., where users would show stable personalized preferences toward the attributes and the attributes are deemed to influence their experiences for the products greatly. Given the attribute networks derived from the induced subgraphs of co-purchase networks, users’ personalized preferences toward items can be decomposed as multi-attribute item relationships, and meanwhile the explanations for recommendations can be derived with regards to both item-based and content-based methods.

3. Methodology

Figure 2. The architecture of eRAN.

The overview of the modeling framework is shown in Fig. 2. Firstly each item in the attribute network is represented by K-attribute embedding vectors with deep autoencoders. Then, users are mapped to an embedding layer and the personalized preferences toward attributes are captured by an attention mechanism, so as to compute the personalized item similarity. Finally a personalized ranking loss is constructed to account for the individual item pairs and also apply negative sampling for non-paired items.

3.1. Embedding Attribute Networks

Attribute network provides a new perspective in probing users’ preferences by mapping items to manifold space with different types of connections, which would overcome the limitations when handling raw item attributes in Euclidean space. Therefore, it is naturally appealing to firstly map items to low-dimensional vectors to encode the items in attribute space. For each node in attribute network , we can learn a mapping function , to obtain a -dimensional vector for the node, i.e., , where .

As discussed previously, higher-order proximity in the attribute network can disclose the item similarity more accurately, and meanwhile users’ preferences toward a specific attribute would propagate along the network path. For example, in Fig. 1, node A and node C share a common neighbor B, and they would be regarded as similar in the connections when second-order similarity is taken into account. Specifically, second-order proximity can be defined as the similarity of the neighborhood structure between a pair of nodes, thus we can represent a node by its adjacency vector . is the adjacency matrix constructed from the -attribute network, and the entry when there exists a link between node and .

Considering the high nonlinearity of the network structure, we propose to represent the adjacency vector of each node in the attribute network via a deep autoencoder. Deep autoencoder is a typical deep learning model to handle nonlinearity, which generally consists of two parts including encoder and decoder, with both containing multi-layer nonlinear functions. The feedforward process of the encoder maps the input data to the representation space as follows,


where represents the number of layers for encoder, and denotes corresponding parameters of the layer. Particularly,

can be regarded as the hidden representation for

when . In similar to the encoder part, decoder also applies several non-linear functions to mapping the representation vectors to the reconstruction space, and obtain the reconstructed output

. By minimizing the reconstruction error between the input and the output, we can derive the representation, and the loss function can be formulated as follows.


We treat the adjacency vector of the node in the -attribute as input and feed it into the autoencoder to derive the hidden representations for the item in the -attribute network.

However, the network may be sparse and the adjacent matrix would be filled with many zeros. Thus, traditional autoencoder may be more likely to reconstruct zeros and hence cannot capture the local connectivity of the network structure. In order to tackle this issue, we impose a larger penalty on the reconstruction error of the non-zero elements (Wang et al., 2016) by incorporating a regularizer . When , is set to be a small value , and the loss can be revised as follows.


where means the Hadamard product.

It is worth addressing that this attribute-network representation framework is capable of handling cold-start items with attributes given. Though a newly released item has no prior co-purchase records, the attributes provide clues to connect it with existing attribute networks. Specifically, we can regard the item as a new node with attribute vector . Then for the -attribute network, the edges can be connected to those existing nodes that have the same attribute value with . With the derived parameters of the autoencoder structure, we can further obtain the hidden representation of the new item.

3.2. Personalized Item Similarity Based on Attribute Networks

Item-to-item CF is a typical method by employing the neighborhood items to compute the recommendation score for an item. In this approach, the item similarity is computed by the inner dot between the latent factors of the item pairs, which is generally in the following form according to (Kabbur et al., 2013).


where and denote the latent factors of items respectively. The similarity between item and obtained through the inner dot is indeed uniform across distinct users, which remain to be a major limitation for these methods. Therefore, it is desirable to take an individual view to account for the item similarity.

With the derived representations from attribute networks, we can replace the item latent factors with the node embeddings to devise a neural model for item-based recommendation. Moreover, we can decompose the item-to-item similarity by taking users’ preferences toward each field of attribute into account. Each user can firstly be mapped to an embedding vector

, and then the user’s preferences toward a particular attribute of an item can be captured through an attention mechanism. Attention mechanisms are generally introduced in NLP, computer vision, and recommender systems to track the attractions of different components. Specifically, in scoring the attention weights of user

for item on a particular attribute , we simply take the inner dot between the user embedding vectors and the node embeddings from the deep autoencoder of the -attribute as follows,


We can further apply softmax to normalize user’s attention scores for an item on each attribute ,


The attention weights can be explained as the extent to which the user desire for a particular attribute of an item, and thus can be exploited to provide explanations for the attribute-aware recommendation.

Then, we can proceed to derive each individual user’ affinity for an item through the node embeddings from all the attribute network, which can be viewed as a weighted average of the node embeddings from each type of attribute network.


Motivated by the general idea of item-to-item methods, we can employ the item representations to compute the personalized similarity in the neighborhood, which can be approximated by,


Different from prior item-based methods, we replace the inner dot with L2-norm distance metric to measure the item relationships with the embedding vectors. As proved in  (Hsieh et al., 2017), inner dot violates the triangle inequality, which may lead to suboptimal solution. Moreover, the personalized item similarity indeed decomposes the relationships towards attributes, which can provide fine-grained explanations for recommendations.

3.3. Loss Function and Optimization

Since we obtain the personalized similarity for each pair of items, we can use it to guide the learning of both user embeddings and item representations, as well as the attention weights on different attributes. An underlying assumption is that users would remain stable in their preferences for the items and therefore the neighborhood item tend to be similar in view of the users. Specifically, given a particular user and one of the purchased item , it should be similar to the neighborhood items . Thus, the representations can be learned by maximizing the aggregate personalized similarity, and can be written as a loss function by taking a negative value based on Equation (8),


where represents all the users. However, this loss function is likely to get trapped to a trivial solution when all the items are approximated by the same representation. Thus, similarly to the optimization techniques proposed in BPR (Rendle et al., 2009) that assume users would prefer items they have bought than those that they have not, we also introduce a negative sampling strategy to avoid the issues.

Specifically, given a user , we can sample an item as a negative sample. Then for each item and a co-purchased item by , it is natural that the similarity should be higher than that between the non-paired items and , which would satisfy the following inequality.


Then the loss function with negative sampling can be revised as,


To preserve the attribute network structures and learn personalized item presentations tailored for recommendation, we combine the loss functions in Equation (3) and Equation (11) with a weighting parameter to jointly minimizes the following objective function:


We adopt Adaptive Moment Estimation (Adam) to optimize the objective function in Equation (

12). In each iteration, we sample a mini-batch of users and item pairs with its corresponding adjacency matrix to update the parameters.

3.4. Recommendation Score

Similar to traditional item-to-item CF method, when evaluating the recommendation score of user for item given the learned representations, we need to revisit the relationships between item and each item that has ever been purchased by , which can be approximated by,


where represents the set of items that have been rated by the user except for , i.e., the neighborhood of item .

In particular, for a new item , since it does not have any prior co-purchase records, we can only connect it to the existing nodes in the attribute networks. According to the learned parameters and representations, we can also construct the individual representation for user . However, item never appears together with any of neighborhood items, thus we relax the restriction and derive the recommendation score with regards to the minimum similarity with the neighborhood items.


When generating recommendations for , we simply need to rank candidate items according to the recommendation score and select the ones with highest scores as recommended items. In this recommendation framework, we can easily interpret the recommendations with both the personalized item similarity and the users’ attention weights on attributes of each item. Specifically, when the item is to be recommended to user , we can obtain the attention weights according to Equation (5) and (6), to identify which attributes of that attract the user; meanwhile, we can also position the item in the neighborhood that are most similar to . Therefore, we can recommend to with the following explanations: “We recommend because it is similar to on the attribute .

4. Experimental Setup

We validate the effectiveness of the proposed methods on three real-world datasets. Eight state-of-the-art (SOTA) baseline methods are included for a thorough comparative study.

4.1. Data Sets

We first briefly introduce the datasets used in our experiments, with the statistics listed in Table 1.

Kaggle-Movie: This dataset is extracted from the Kaggle111 Challenge Dataset. We use the directors, genres and top five actors as attributes.

Goodreads-Potery: This dataset is collected by Wan et al. (Wan and McAuley, 2018) from a popular online book review website named Good-reads222 Several attributes are used including authors, number of pages, publication-year, and top three user-generated shelf names.

Amazon-Music: Each top-level product category on the Amazon333 are constructed as a separate datasets by McAuley et al. (McAuley et al., 2015). We choose the dataset constructed from the music category, and extract the top three genres as well as the price as attributes.

In this paper, we remove items with missing values, treat ratings larger than 3 as positive feedbacks, and retain users whose history length larger than 5, 5 and 3 for Kaggle-Movie, Goodreads-Potery and Amazon-Music respectively.

# users # items # actions # features
Kaggle-Movie 663 6850 61088 3
Goodreads-Potery 39540 24052 449401 4
Amazon-Music 11697 7100 65950 2
Table 1. Data statistics.

4.2. Baseline Methods

The following SOTA methods are applied as baselines in our experiments.

NMF (Paatero and Tapper, 2010): NMF is a widely used collaborative filtering approach, which factorizes the interaction binary matrix.

BPR-MF (Rendle et al., 2009): BPR-MF is a well-known top-N recommendation method to cope with implicit matrix, which uses the Bayesian personalized ranking optimization criterion.

FM (Rendle, 2010): FM is a successful feature-based recommendation method, which is effective on sparse data.

DeepFM (Guo et al., 2017): DeepFM is a deep variant of FM which imposes a factorization machines as ”wide” module to extract shallow feature interactions.

PNN (Qu et al., 2016): PNN is another deep variant of FM which introduces a product layer after embedding layer to capture high-order feature interactions.

AFM (Xiao et al., 2017): AFM extends FM by using attention mechanism to distinguish the different importance of second-order combinatorial features.

SVDFeature (Chen et al., 2012): SVDFeature is an effective toolkit for feature-based matrix factorization.

FISM (Kabbur et al., 2013): FISM is a state-of-the-art item-based CF method which learns global item similarities from user-item interactions.

eRAN-L1: eRAN-L1 is a submodel which only optimizes the ranking loss.

eRAN-L2: eRAN-L2 is another submodel which only optimizes the reconstruction loss with . In this submodel, we fix user embeddings to 1.0 during training.

4.3. Parameter Settings

For our method, we set the mini-batch size, the learning rate of the Adam, the hyper-parameters of and to be 2000, 0.001, 1500 and 0.2 respectively. We keep the same structure of autoencoder with varying datasets. Specifically, the dimensions of hidden states are 1024, 256 and 32 for and respectively according to Equation (1). As for the baseline methods, we apply default parameters except for the embedding size, which is fixed to be 32 for all the methods.

5. Experimental Results

Method Kaggle-Movie Goodreads-Potery Amazon-Music
P@5 P@10 P@15 P@5 P@10 P@15 P@5 P@10 P@15
NMF 0.1201 0.0746 0.0548 0.1386 0.0779 0.0525 0.0902 0.0602 0.0454
BPR-MF 0.1210 0.0742 0.0547 0.1412 0.0801 0.0565 0.0806 0.0516 0.0392
FISM 0.1217 0.0736 0.0536 0.1528 0.0827 0.0576 0.0951 0.0636 0.0465
FM 0.1168 0.0725 0.0531 0.1524 0.0844 0.0578 0.0844 0.0530 0.0386
DeepFM 0.1183 0.0726 0.0542 0.1540 0.0849 0.0590 0.0874 0.0559 0.0419
PNN 0.1195 0.0719 0.0537 0.1557 0.0842 0.0590 0.0875 0.0571 0.0428
AFM 0.1154 0.0721 0.0528 0.1376 0.0783 0.0550 0.0739 0.0497 0.0387
SVDFeature 0.1219 0.0751 0.0556 0.1547 0.0848 0.0588 0.0943 0.0637 0.0480
eRAN-L1 0.1161 0.0733 0.0532 0.1485 0.0836 0.0572 0.0707 0.0470 0.0369
eRAN-L2 0.0237 0.0190 0.0161 0.0305 0.0218 0.0163 0.0541 0.0385 0.0305
eRAN 0.1289 0.0789 0.0570 0.1626 0.0875 0.0604 0.1104 0.0691 0.0508
Table 2. Precision@K of the three datasets.
Method Kaggle-Movie Goodreads-Potery Amazon-Music
n@5 n@10 n@15 n@5 n@10 n@15 n@5 n@10 n@15
NMF 0.4451 0.4986 0.5213 0.5837 0.5988 0.6149 0.3280 0.3697 0.3941
BPR-MF 0.4554 0.5002 0.5206 0.6044 0.6256 0.6379 0.2807 0.3173 0.3362
FISM 0.4593 0.5017 0.5195 0.6579 0.6783 0.6923 0.3672 0.4075 0.4253
FM 0.3639 0.4200 0.4419 0.6222 0.6508 0.6595 0.2971 0.3324 0.3454
DeepFM 0.3782 0.4267 0.4503 0.6370 0.6627 0.6731 0.2969 0.3363 0.3547
PNN 0.3822 0.4341 0.4586 0.6527 0.6747 0.6827 0.3015 0.3431 0.3618
AFM 0.3727 0.4289 0.4527 0.5552 0.5857 0.5968 0.2437 0.2867 0.3079
SVDFeature 0.4272 0.4755 0.4964 0.6464 0.6716 0.6873 0.3370 0.3888 0.4122
eRAN-L1 0.3709 0.4312 0.4491 0.6174 0.6581 0.6693 0.2164 0.2622 0.2844
eRAN-L2 0.0684 0.0916 0.1064 0.0852 0.1019 0.1467 0.1805 0.2173 0.2367
eRAN 0.4702 0.5167 0.5340 0.6858 0.7073 0.7154 0.4026 0.4482 0.4671
Table 3. nDCG@K of the three datasets.

5.1. Recommendation Accuracy

We first conduct a comparative study to validate the superiority of our model to the introduced baseline methods in terms of recommendation accuracy. In this task, we adopt the leave-one-out evaluation strategy, that is, for each user, we hold-out one purchased item as test set and the remaining is used for training. Since it is too time-consuming to rank all the items for every user during evaluation, we follow the experimental settings in (He et al., 2017) which randomly samples 100 negative items and rank the recommendation score among the 100 items. Given the top-K ranked items, we apply Precison@K and nDCG@K as evaluations measures. The comparative results of the three datasets are in Table 2 and Table 3.

The proposed model is consistently better than all the baselines on the three datasets, while in contrast, the second best is relatively unstable, showing that our methods are more robust. In addition, we find that FISM outperformed other baselines in many cases in nDCG, while SVDFeature and PNN performed better in Precision. This results indicate that eRAN can not only accurately recognize the items that users really prefer, but also tend to rank them at top positions simultaneously.

Moreover, we find that most attribute-based methods perform well on Goodreads-Potery particularly. A possible reason is that the attribute user-generated shelf names and authors may have great influences on user preferences, and our method can effectively infer users’ preferences toward attributes. Also, it is notable that eRAN achieves the greatest improvement on Amazon-Music with the most sparse ratings among the three datasets. It might be due to the attribute network simultaneously model the first-order relationship and the high-order relationship from attribute space, which can handle the data sparsity.

It is also notable that AFM doesn’t perform well on the three datasets, even worse than FM, which may be due to the fact that it is unable to learn effective attention weights in feature interaction space when features are scarce. While on the contrary, eRAN can leverage attention mechanism to model user’s fine-grained preferences in the attribute space.

Figure 3. Prediction results for cold-start items.

Considering the two variants of eRAN, the submodel eRAN-L2 can be seen as a kind of network embedding method which lacks optimization tailored for recommendation, which achieves the worst performances. Meanwhile, eRNA outperforms eRAN-L1 significantly, which validates the effectiveness of leveraging attribute information in improving performances.

5.2. Cold Start Item Recommendation

In this task, we evaluate the effectiveness of our model in handling cold start items with attributes given. To simulate the scenarios for cold-start items, we randomly hold out 40 items and regard them as new items with no purchasing records. We can treat the item as a new node and connect it with existing attribute networks according to Section 3.1. Afterwards, we can obtain the adjacency matrix of the new node in each attribute network and feed into the deep autoencoder to derive respective node embeddings.

Then, for each ‘new’ item, we can rank all the users according to the recommendation score in Equation (14) to predict which users are most likely to purchase the items. We can use the measure Recall to evaluate the effectiveness of prediction, i.e., how many users are accurately predicted to buy the new item among all the users that have purchased in reality. The methods NMF, BPR, and FISM cannot be applied for this setting because the new items would not appear in the rating matrix. Thus, we only implement this experiment on the attribute-based recommendation methods, in which we remove the corresponding new items in the training phase and evaluate the results on prediction with the same setting of our method.

The results on Kaggle-Movie are illustrated in Figure 3. As we can see that our methods consistently outperform other attribute-based recommendation methods. Among the baselines, FM, DeepFM and AFM achieve similar performances in this experiment, PNN performs the second best when is small, while SVDFeature shows competitive results when is large. The results show the superiority of the proposed method in coping with cold-start items, which also illustrate that eRAN well capture fine-grained user preferences toward attributes.

5.3. Explanation and Visualization

One of the pervasive advantages our model is that we can obtain insights into the underlying reasons for recommendation. Thus, we explore the learned user embeddings and attention weights on attributes from both quantitative and qualitative perspectives to explain the recommendation. Take the Kaggle-Movie dataset as an example, for each user, we regard the users’ average attention scores for all the items that they have interacted with as the general description of their preferences. Correspondingly, each user in the movie dataset can be described with attentions scores on actor, director, and genre. Larger attention score on an attribute means the user may prefer the corresponding aspect more.

Movie User Actor Attention Score Director Attention Score Most Similar Movies Explanation
Fear and Loathing in Las Vegas 255 0.6006 0.3021 Edward Scissorhands, A Nightmare on Elm Stree The same actor Johnny Depp
639 0.3404 0.5159 The Meaning of Life, Monty Python and the Holy Grail The same director Terry Gilliam
Pulp Fiction 467 0.4941 0.2874 Django Unchained, Jurassic Park The same actor Samuel L. Jackson
129 0.3091 0.4708 Kill Bill, Reservoir Dogs The same director Quentin Tarantino
Table 4. Two comparative case studies for explainable recommendation.

We firstly validate whether the attention mechanism actually play a role for identifying users’ preferences. Specifically, according to the learned parameters previously, we select 40 users with the largest actor-attention score, 40 users with the smallest actor-attention and another 40 random users as three separate test groups, and we denote them as Max, Min, and Random respectively. We then remove the actor network to train a new model, and other settings remain the same. We can then test the recommendation performances for the three test user groups, with the results shown in Fig. 3(a) and Fig. 3(b). We can see the lack of actor network affects differently on the three test groups. The overall performance order is as follows: Min ¿ Random ¿ Max, showing that Max group is severely influenced, which confirms our analysis that these users are more concerned about actors. Also, the derived user embeddings projected by t-SNE  (van der Maaten and Hinton, 2008) are also illustrated in Fig. 3(c). It’s not hard to find out that Max and Min are clearly separated apart from each other, which demonstrates the effectiveness of the learned user embeddings in distinguishing users with different preferences.

In addition, we pick two cases to explain the attention scores output by eRAN in Table 4. We can see user 255 and user 639 both watch the movie Fear and Loathing in Las Vegas. However, the model can distinguish that user 255 is driven by the same actor, while user 639 is driven by the same director according to attention scores. Meanwhile we can compute the most similar movies to them and find that user 225 is keen on the actor Johnny Depp while user 639 like the director Terry Gilliam. Therefore, we can easily use eRAN to provide the explanations like ”A is similar to B and C, especially with the same actors.

(a) Precision
(b) nDCG
(c) User Embedding
Figure 4. Recommendation results and user embeddings for different user groups.
(a) Embedding Size
Figure 5. Impact of hyper-parameters on ranking performance.

5.4. Parameter Sensitivity

In this subsection, we examine the sensitivity of two parameters, i.e., the embedding size and the weighting parameter in the loss function.

Embedding size. Figure 4(a) demonstrates the impact of embedding sizes on the results. It’s easy to find that 32 is the best embedding size for both Kaggle-Movie and Amazon-Music as measured by Precision and nDCG. Moreover, the performances remain stable in all the settings, which shows the robustness of our method.

The weighting parameter . From the results shown in Figure 4(b), we can see the performance first increases along with , and then begins to drop when ¿ 1000. It is worth mentioning that our model would reduce to eRAN-L2 when is infinitesimal, and to eRAN-L1 when is infinity. Their performances are consistent with the trend indicated by the sensitivity analysis.

6. Related Work

Our work is related to the following streams of recommendation methods including item-based, attribute based, as well as explainable recommendation.

The idea of item-based CF methods is that the prediction of a user on a target item depends on the similarity of this item to all items the user has interacted with in the past. Traditional item-based CF methods often predefine some similarity measures such as cosine similarity and Pearson coefficient 

(Sarwar et al., 2001). Another common approach is to employ random walks on the user-item bipartite graph (Liu et al., 2017)

. However, such heuristic similarity measurement lacks optimization tailored for different datasets, and thus may yield suboptimal results. Recently, Ning et al. has proposed a method SLIM which learns item similarity directly from data 

(Ning and Karypis, 2011). The idea is to reconstruct the original user-item interaction matrix by the item-based CF model. Afterwards, Kabbur et al. further proposes FISM to explore the low-rank property of the learned similarity matrix to handle data sparsity problem (Kabbur et al., 2013). While FISM is shown to outperform recommendation approaches, it has the limitation in estimating only a single global metric for all users. To that end, GLSLIM clusters the users and estimates an independent SLIM model for every user subset (Christakopoulou and Karypis, 2016), whereas the number of clusters is difficult to determine, and thus the modeling of personalized preferences is coarse-grained.

In addition to user-item interactions, many researchers attempt to leverage additional information for recommendation, such as user-item attributes and context information (Zhao et al., 2016; Baltrunas et al., 2011). FM is an earlier general feature-based framework for recommendation, which is suitable for sparse structured data (Rendle, 2010), and it is recognized as the most effective linear embedding method. Due to the recent huge success of deep learning in many fields, some deep variants of FM have been proposed to enhance the model’s representation capacity, including AMF  (Xiao et al., 2017), DeepFM  (Guo et al., 2017) and PNN  (Qu et al., 2016). In these methods, the feature weights are the same for all the users and they cannot capture the fine-grained user preferences, which are not explainable.

Recently employing auxiliary information to help understand user behaviors and provide explainable recommendations have become prevailing in the research field. Zhang et al. propose EFM  (Zhang et al., 2014)

, where the basic idea is to align each latent dimension in matrix factorization with a particular explicit feature, and recommend items that performs well on the features that users care about. Chen et al. further extended the EFM model to tensor factorization afterwards  

(Chen et al., 2016). On the other hand, McAuley and Leskovec propose HFT to understand the hidden factors in latent factor models based on the hidden topics extracted from textual reviews  (McAuley and Leskovec, 2013). After that, many probabilistic graphic model based methods have been proposed for explainable recommendation  (Wu and Ester, 2015; Ren et al., 2017). Recently, deep learning and attention mechanism have attracted much attention in the recommendation field, and they have also been wildly applied for explainable recommendations. For example, Seo et al. leverage attention mechanisms upon the user/item reviews to explore the usefulness of reviews and with the learned attention weights, the model can indicate which part is more important  (Seo et al., 2017). Chen et al. propose VER which can highlight the image regions that a user may be interested in as explanations  (Chen et al., 2018). Our work follows this thread but mainly focuses on learning explanations through user behavior data rather than text data.

7. Conclusions

In this paper, we propose a personalized item to item recommendation method eRAN. By formulating the co-purchased relationships and item attributes as multiple attribute networks, eRAN combines both views of recommendations. By plugging an attention mechanism in obtaining personalized item representation, eRAN gains ability to derive the attractions of attributes to users and personalized item similarity simultaneously. Experiments on real-world datasets demonstrate the superiority of our methods for recommendation tasks and cold-start items. Moreover, the learned user embeddings and attention weights capture the fine-grained user preferences on attribute level and guide the explanations for recommendations. Future work includes integrating multi-item relationships such as complementation and substitution into our model, and seeking the influence of other different attention mechanisms.


  • G. Adomavicius and A. Tuzhilin (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge & Data Engineering (6), pp. 734–749. Cited by: §1.
  • L. Baltrunas, B. Ludwig, and F. Ricci (2011) Matrix factorization techniques for context aware recommendation. In Proceedings of the fifth ACM conference on Recommender systems, pp. 301–304. Cited by: §6.
  • T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu (2012) SVDFeature: a toolkit for feature-based collaborative filtering.

    Journal of Machine Learning Research

    13 (Dec), pp. 3619–3622.
    Cited by: §4.2.
  • X. Chen, Z. Qin, Y. Zhang, and T. Xu (2016) Learning to rank features for recommendation over multiple categories. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 305–314. Cited by: §6.
  • X. Chen, Y. Zhang, H. Xu, Y. Cao, Z. Qin, and H. Zha (2018) Visually explainable recommendation. arXiv preprint arXiv:1801.10288. Cited by: §6.
  • E. Christakopoulou and G. Karypis (2016) Local item-item models for top-n recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 67–74. Cited by: §6.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017)

    Deepfm: a factorization-machine based neural network for ctr prediction

    arXiv preprint arXiv:1703.04247. Cited by: §4.2, §6.
  • X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. Cited by: §5.1.
  • C. K. Hsieh, L. Yang, Y. Cui, T. Y. Lin, S. Belongie, and D. Estrin (2017) Collaborative metric learning. Cited by: §3.2.
  • S. Kabbur, X. Ning, and G. Karypis (2013) Fism: factored item similarity models for top-n recommender systems. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 659–667. Cited by: §1, §1, §2, §3.2, §4.2, §6.
  • D. C. Liu, S. Rogers, R. Shiau, D. Kislyuk, K. C. Ma, Z. Zhong, J. Liu, and Y. Jing (2017) Related pins at pinterest: the evolution of a real-world recommender system. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 583–592. Cited by: §6.
  • J. McAuley and J. Leskovec (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pp. 165–172. Cited by: §6.
  • J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §4.1.
  • X. Ning and G. Karypis (2011) Slim: sparse linear methods for top-n recommender systems. In 2011 11th IEEE International Conference on Data Mining, pp. 497–506. Cited by: §6.
  • P. Paatero and U. Tapper (2010) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (2), pp. 111–126. Cited by: §4.2.
  • Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang (2016) Product-based neural networks for user response prediction. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pp. 1149–1154. Cited by: §4.2, §6.
  • Z. Ren, S. Liang, P. Li, S. Wang, and M. de Rijke (2017) Social collaborative viewpoint regression with explainable recommendations. In Proceedings of the tenth ACM international conference on web search and data mining, pp. 485–494. Cited by: §6.
  • S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In

    Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

    pp. 452–461. Cited by: §3.3, §4.2.
  • S. Rendle (2010) Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000. Cited by: §4.2, §6.
  • B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2001) Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285–295. Cited by: §1, §6.
  • S. Seo, J. Huang, H. Yang, and Y. Liu (2017)

    Interpretable convolutional neural networks with dual local and global attention for review rating prediction

    In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 297–305. Cited by: §6.
  • L. van der Maaten and G. E. Hinton (2008)

    Visualizing high-dimensional data using t-sne

    JMLR 9, pp. 2579–2605. Cited by: §5.3.
  • M. Wan and J. McAuley (2018) Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 86–94. Cited by: §4.1.
  • D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §3.1.
  • Y. Wu and M. Ester (2015) Flame: a probabilistic model combining aspect based opinion mining and collaborative filtering. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 199–208. Cited by: §6.
  • J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua (2017) Attentional factorization machines: learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617. Cited by: §4.2, §6.
  • Y. Zhang and X. Chen (2018) Explainable recommendation: a survey and new perspectives. arXiv preprint arXiv:1804.11192. Cited by: §1.
  • Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma (2014)

    Explicit factor models for explainable recommendation based on phrase-level sentiment analysis

    In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 83–92. Cited by: §6.
  • W. X. Zhao, S. Li, Y. He, E. Y. Chang, J. Wen, and X. Li (2016) Connecting social media to e-commerce: cold-start product recommendation using microblogging information. IEEE Transactions on Knowledge and Data Engineering 28 (5), pp. 1147–1159. Cited by: §6.
  • X. W. Zhao, Y. Guo, Y. He, H. Jiang, Y. Wu, and X. Li (2014) We know what you want to buy: a demographic-based system for product recommendation on microblogs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1935–1944. Cited by: §1.