Recommender systems help users find their potentially interested products from an enormous list of products. Matrix Factorization (MF) methods (Koren et al., 2009) are widely adopted in recommendation systems because of its accuracy and scalability. MF methods usually rely on the explicit (e.g., user ratings) or implicit (e.g., click behaviors) interactions between users and products for recommendation. However, a rating or binary interaction can only reflect the user’s overall attitude towards a product, which does not include information about the underlying reasons for the user behavior. As a result, it is difficult for MF methods to model user’s fine-grained preferences on specific product features and provide explanation to recommendations.
To tackle these limitations, researches have attempted to utilize reviews to alleviate the data sparsity problem and provide more explainable recommendations (Chen et al., 2015; Cheng et al., 2018a; He et al., 2015; Cao et al., 2017; Cheng et al., 2019)
. As accompanying information of ratings, the textual review expresses user’s opinions on different product features, and thus contains more fine-grained information about the user preference. Different strategies have been applied to incorporate reviews into MF models, including sentiment analysis(Pero and Horváth, 2013), representation learning (Zhang et al., 2017; Catherine and Cohen, 2017), and topic models (McAuley and Leskovec, 2013; Tan et al., 2016)
. Although these methods have achieved some progress, the generated vector representations of users and products are still latent and thus cannot explicitly model user’s preference on specific product features, which could impede their performance.
Another direction is to leverage the aspects mentioned in user reviews for recommendation. In this paper, aspect is defined as the words or phrases used by users in their product reviews to describe product features. For example, “battery life” and “battery duration” are two different aspects while they refer to the same product feature. There are already some methods which detect aspects in user reviews and leverage them to model user’s fine-grained preferences to specific product features (Zhang et al., 2014a; Dong and Smyth, 2017). For example, EFM (Zhang et al., 2014a) conducted aspect-level sentiment analysis to extract user’s preference and product’s quality on specific product feature, then incorporated the results into an MF framework to provide more accurate recommendation. SULM (Bauman et al., 2017) and LRPPM (Chen et al., 2016) went beyond EFM (Zhang et al., 2014a) by using more effective methods to identify the impact of each aspect on the overall rating. However, these methods rely highly on the accuracy of external sentiment analysis tools.
Besides the above mentioned limitations, these methods also suffers from the following two problems. First, for each user-product pair, they only consider the shared aspects in the user’s reviews and the product’s reviews. However, due to the sparsity of user-product interactions and users’ diverse language usages, the number of common aspects mentioned in the reviews of both the targeted user and product is usually very limited. Second, a user’s concerned aspects may be different for different products (even in the same category). For example, a user may mostly concern about “special effects” when watching a super-hero movie, while pay more attentions to the “plot” for a suspense movie.
Motivated by the above concerns, in this paper, we propose an Attentive Aspect-based Recommendation Model (AARM), which can effectively tackle the above two problems. For the first problem of aspect sparsity, AARM models the interactions between synonymous and similar aspects to alleviate it, where synonymous aspects are the ones referring to the same product feature (e.g., “storyline” and “plot”); and similar aspects are those of different features that are closely related (e.g., “battery life” and “charging speed”). Intuitively, a user’s attention to an unmentioned aspect can be inferred through its similar aspects. For instance, a user who cares about “battery life” of cellphones may also care about its “charging speed”, although “charging speed” has never been mentioned in this user’s reviews. In our model, an aspect extracted from reviews is first represented as an embedding vector. Then a user ’s satisfaction about product according to aspect
is estimated by calculating the interactions betweenand all the aspects mentioned in ’s reviews. And an attention module is designed to pick up the interactions between meaningful aspect pairs. In this way, we achieve the goal of capturing the interactions between synonymous and similar aspects.
For the second problem of identifying user’s varied interests on aspects, AARM introduces another attention module which takes user, product and aspect information into consideration. In this way, user’s varied interests on aspects can be captured by the product-dependent user attention. Instead of rating prediction, we target the top-N recommendation task with a pair-wise learning-to-rank method, which is the most practically used recommendation scenario in real-world systems (Cremonesi et al., 2010; Wang et al., 2016). To this end, our model estimates a user ’s satisfaction towards an product by (1) estimating ’s performances on ’s concerned aspects; and (2) identifying the impacts of these aspects on the overall satisfaction.
We evaluate our model on five product datasets from Amazon on the top-N recommendation task. Experimental results show that AARM outperforms several state-of-the-art methods. Comparative experiments have also been conducted to demonstrate the importance of modeling interactions between different aspects and the effectiveness of our attention module on capturing user’s varied attentions towards aspects. Our main contributions are outlined as follows.
We propose a novel recommendation method to model the interactions between both the same and the different aspects, which helps to alleviate the aspect sparsity problem in reviews. To the best of our knowledge, this is the first attempt to model the interactions between different aspects to model user preferences in recommendation. And the method to capture the similarity relation between different aspects can also be used in other recommendation scenes (e.g. recommendation with tags or item metadata).
We design an attention mechanism in AARM to capture user’s varied attentions on different aspects towards various products. The careful design of the inputs and structure of this attention module has been demonstrated to be very effective on improving the recommendation accuracy in the experiments.
We conduct extensive experiments on real-world datasets to demonstrate the effectiveness of our model. Experimental results show that our method can achieve superior performance by a large margin.
The reminder of the paper is organized as follows. We first discuss existing works related to our method in Section 2. In Section 3, we describe the details of AARM and describe how to train the model. In Section 4, we describe the experimental settings and report the results to verify our assumptions and compare our methods with some state-of-the-art baselines. Finally, in Section 5, we conclude the paper.
2. Related Work
In recent years, many researchers have paid more attentions to users’ product reviews in order to improve the recommendation accuracy and provide recommendation explanation. According to how these methods utilize user reviews, we broadly group them into three categories: review-level, topic-level and aspect-level methods. In this section, we first review these three types of review-based methods, and then briefly discuss the recommendation methods with attention mechanism which is an important component in our model.
2.1. Review-level Methods
Review-level methods treat the review as a single piece of information and incorporate it with ratings. The opinion-driven matrix factorization model (Pero and Horváth, 2013) calculates the overall opinion of a review by summing up the orientations of opinion words in the text, and then combines it with numerical ratings for rating prediction. Meng et al. (Meng et al., 2018)
incorporated other users’ emotions towards a review to calculate the importance of this review in the training of matrix factorization model. Some methods concatenate all the reviews belonging to a user (or item) as a user (or item) document, and then employ deep learning methods to learn the continuous vector representation for the user (or item)(Catherine and Cohen, 2017; Zheng et al., 2017; Guo et al., 2018; Zhang et al., 2017). For example, Transnets (Catherine and Cohen, 2017) and DeepCoNN (Zheng et al., 2017)
process the user and item documents with convolutional neural network to generate the vector representation for users and items. JRL(Zhang et al., 2017) adopts the PV-DBOW model (Le and Mikolov, 2014), which is an unsupervised methods to learn the continuous vector representations for documents, and the user and item vector representations from their reviews. In Transnets, DeepCoNN and JRL, in order to estimate the matching degree between a user and an item, reviews of the user or item are compressed to a vector which is an overall representation of the reviews. In this way, these review-level methods neglect the user-item interactions at the review components (e.g. the user’s opinions about the product’s specific features) level, which can be used to connect the user with candidate products and provide more explainable recommendation.
2.2. Topic-level Methods
Topic-level methods build probabilistic graphical model to extract topics from reviews. HFT (McAuley and Leskovec, 2013) combines topic vectors from reviews with latent factors from ratings to improve rating prediction accuracy. Subsequently, some studies employ different topic models and combination strategies for the review-based rating prediction task. For example, different from HFT, ITLFM (Zhang and Wang, 2016) linearly combines the latent topics and the latent factors. CMR (Xu et al., 2014) is a probabilistic graphical model which simultaneously associates the review text, the hidden user communities and item group relationship with numerical ratings. RBLT (Tan et al., 2016) also utilizes LDA to extract topics from review text. Then the preference distribution vector of each user and the recommendability distribution vector of each item are combined with vanilla matrix factorization model for rating prediction. More recently, Cheng et al. (Cheng et al., 2018b)
defined a high-level semantic concept ‘aspect’ as a probability distribution of topics. They proposed the ATM model to extract topics from reviews and associated the topics with ‘aspects’, and then proposed the ALFM model to associate latent factors with ‘aspects’. In this way, topics are correlated with factors via the ‘aspects’ indirectly. To estimate the overall rating score, they first calculated the item’s scores on each aspects and then summed them up using aspect importance as weights. Similarly, MMALFM(Cheng et al., 2019) follows the definition of ‘aspect’ in (Cheng et al., 2018b) and jointly models the ‘aspects’ in textual reviews and item images. These topic-level methods usually focus on rating prediction task, while we are targeting at top-N recommendation. Similar to review-level methods, when estimating the matching degree between a user and a product, these topic-level methods also neglect the interactions between the components of the user and the product’s reviews. And it is difficult to associate a topic, which is a probabilistic distribution over words or phrases, with specific product features. Because of these limitations, these methods are incapable of capturing user’s preference towards product features in a finer-grained manner and thus provide more accurate and explainable recommendations.
2.3. Aspect-level Methods
Aspect-level methods extract aspects from reviews and incorporate them with ratings for recommendation. The proposed method in this paper falls into this category. Ganu et al. (Ganu et al., 2013) manually defined six aspects and four sentiments for restaurant reviews and used a regression-based method for rating prediction. Zhang et al. (Zhang et al., 2014a) employed an unsupervised tool for aspect extraction and aspect-level sentiment analysis. Aspect and sentiment outputs from this step were integrated with matrix factorization methods for rating prediction. Chen et al. (Chen et al., 2016)
proposed a tensor-matrix factorization method to select the most interesting product features for each user with a learning to rank method. The rating scores were then predicted as the weighted summation of the product’s sentiment scores on the user’s most cared product features. Bauman et al.(Bauman et al., 2017) also extracted aspects and conducted aspect-level sentiment analysis with external tools. The results of aspect-level sentiment analysis were used in their model SULM as the ground-truth labels to train a latent factor model for every aspect. These aspect-level latent factor models were then used to predict user’s sentiment scores toward each aspect of a product. The number of parameters in SULM is very large as a user or product usually has many aspects. As we can see, the above methods often rely on external sentiment tools for aspect-level analysis.
Specifically, there are also some papers which pay attention to users’ varied interests. Chen et al. (Chen et al., 2016) proposed an aspect ranking method to capture user’s varied interests while they paid more attention to a user’s interest variation over different categories. ANCF (Cheng et al., 2018a), ALFM (Cheng et al., 2018b) and ANR (Chin et al., 2018) also try to capture users’ varied interests towards aspects. Specially, ANCF and ANR also use neural attention layers to do it. But there are some important differences between them and our method. First, the ‘aspect’ defined in ANCF, ALFM and ANR are different from the one defined in our model. In ANCF, ‘aspect’ is defined as a combination of topic vector and embedding vector. In ALFM, ‘aspect’ is defined as a probability distribution of topics and thus ALFM is more like a topic-level model. In ANR, an ‘aspect’ of a user is a weighted sum of all the words’ embeddings in the user’s reviews. Different from them, ‘aspects’ in our model are words or phrases directly extracted from reviews which are much more fine-grained concept. Second, ANCF, ALFM and ANR have not considered interactions between different aspects. Different from them, our method models these interactions because intuitively these aspects are not independent of each other. Third, those three existing methods are originally designed for rating prediction, while our model is designed for top-N recommendation.
He et al. (He et al., 2015) did not conduct sentiment analysis but adopted the aspect frequency information in reviews to construct the user-item-aspect tripartite graph for recommendation. The improved performance in (He et al., 2015) from baselines verified that the aspect mention signals in reviews could have already been able to reflect user’s interests on aspects. Similarly, in AARM we do not conduct sentiment analysis on reviews explicitly, which helps to simplify the model design and implementation. Moreover, AARM considers both the interactions between different aspects and the user’s varied preference towards aspects, which are neglected by previous studies.
2.4. Attention Mechanism
In recent years, many deep learning-based recommendation methods have been proposed and achieved good performance in many tasks (He and Chua, 2017; He et al., 2017; Yang et al., 2017; Tan et al., 2018). The attention mechanism which can assign adaptive weights for a set of features has also been employed in recommendation models (Chen et al., 2017; He et al., 2018; Cao et al., 2018; Ebesu et al., 2018; Chen et al., 2019). For example, in the NARRE model (Chen et al., 2018) for review-based rating prediction, Chen et al. introduced an attention module to calculate the usefulness of reviews. In TEM (Wang et al., 2018) which utilizes user and item’s side information for explainable recommendation, neural attention layer is used to assign weights to cross features and provide recommendation explanation. ACF (Chen et al., 2017), which focuses on multimedia recommendation, uses a component-level attention module to find informative components for multimedia items (images/videos), and a item-level attention module to select representative items to to represent users’ preferences. AFM (Xiao et al., 2017), which is an extension of FM machine (He and Chua, 2017; Rendle, 2010), uses an attention neural network to discriminate the importance of different feature interactions. ANCF (Cheng et al., 2018a) and ANR (Chin et al., 2018)
also have attention modules which have been discussed in last section. Compared with these methods, we specially design two attention modules for the fine-grained modeling of product features extracted from user reviews. The user-level attention module in AARM is built to find out the user’s most concerned product features for a candidate product, while the aspect-level attention module is constructed to select informative aspect interactions.
3. Attentive Aspect-based Recommendation Model
In this section, we first provide an overview of our method and define some important notations, and then introduce how to extract aspects from user reviews. After that, we describe the structure and details of the proposed AARM model. In particular, we elaborate how AARM could model the interactions between different aspects and handle user’s varied interests in aspects. Finally, we discuss the parameter inference in AARM.
Given a user set and a product set , AARM estimates a satisfaction score for an user towards a product . The candidate products are then ranked in a descending order of and the top N products are recommended to . In our method, aspects extracted from user reviews are used as the explicit features of users and products. We define as the aspect set of the dataset. The aspects that have been mentioned in the reviews of user is represented as , which is a subset of . Similarly, product ’s aspects that have been mentioned in ’s reviews are represented as . Product ’s rating given by user is denoted as , where is the collection of ratings.
The structure of AARM is shown in Figure 1. In the input layer, users and products are represented as binarized sparse vectors using the one-hot encoding method. Above the input layer, the Aspect Interactions part is used to model the interactions between the aspects from user’s aspect set and the aspects from the product ’s aspect set . Because a user’s review for a product may not cover all the factors which can influence the user’s satisfaction towards the product, the aspects extracted from review text may not be able to fully explain the rating. Hence the Global Interactions part is stacked above the input layer to model the implicit factors which influence user’s decision but have not been discussed in the reviews. Finally, the results of aforementioned two parts are concatenated as the input to the Output Layer.
3.2. Aspect Interactions Part
In the Aspect Interactions part, given a user and a product , aspects are first extracted from their reviews and used to construct their aspect sets and , respectively. To model the similarity between aspects, instead of one-hot encoding or bag-of-words model, embedding layers are used in AARM to represents aspects as continuous vectors. Specifically, aspect embedding matrix is defined to project aspects from and to and , respectively, where is the dimension of aspect embeddings, and and are respectively the number of aspects in and . The th aspect in is projected to which is the th column of . Similarly, aspects in are projected to the embedding vectors in . Next, Attentive Aspect-Interaction Pooling Module is designed to model the bi-interactions between the aspect embeddings of and that of , and outputs a vector to represent the preference information in user reviews.
3.2.1. Aspect Extraction
Because the main contribution of this paper focuses on how to leverage aspects for personalized recommendation, we refer to external tools for aspect extractions. In this paper, we use the Sentires111http://yongfeng.me/software/, which has been successfully used in (Zhang et al., 2014a, b)
for aspect extraction. Other aspect extraction tools can also be applied. This toolkit extracts aspects via a hybrid of rule-based and machine learning algorithms. Given a dataset, it generates an aspect lexicon, which is used to build the aspect setof the dataset in this paper. With this toolkit, we could obtain user aspect set for each user , and product aspect set for each product by extracting the mentioned aspects from their reviews. Some examples of the automatically extracted aspects are shown in Table 3.
Note that the size of aspect set varies for different users or products. To accelerate the training of AARM, we pad all the user aspect set into the same lengthand pad all the product aspect set into the same length . Taking user aspect set as example, we define a meaningless aspect and add it to the end of user aspect sets whose lengths are less than the predefined size . For whose length is larger than , we calculate the TF-IDF score (Salton et al., 1975) of each , and truncate to aspects by dropping the aspects with low TF-IDF scores. The TF-IDF score is defined as:
where is the frequency of ’s occurrence in ’s reviews, is the number of users, and is the number of users who mentioned . All the product aspect sets are padded into the same length in a similar way.
3.2.2. Attentive Aspect-Interaction Pooling Module
As shown in Figure 2, given and as input, there are four parts in this module: aspect embedding transformation, aspect interaction layer, aspect-level attentive pooling layer, and user-level attentive pooling layer. The final output of this module is the vector which represents the overall satisfaction of a user towards a product estimated with review text. In this module, we hold the assumption that ’s overall satisfaction for is based on ’s performances on ’s concerned aspects (i.e. aspects from ). This module works as follows. First, for each aspect , the aspect interaction layer and aspect-level attentive pooling layer are employed to estimate ’s performance on , where the performance is represented as vector . Then the user-level attentive pooling layer is used to estimate ’s preference towards by integrating for all the aspect and represent the preference as a vector . Finally, will be combined with the result of Global Interaction part and further input into the output layer to estimate the user ’s satisfaction score towards the product .
Aspect Embedding Transformation
To model the interactions between synonymous and related aspects, we expect the vector representation of aspect to encode the similarity relation between aspects. In this paper, the Word2vec model (Mikolov et al., 2013), which is able to encode many linguistic regularities and patterns, is used to pre-train aspect embeddings with the review texts in each dataset. The aspect embedding matrix is initialized with the pre-trained embeddings and its parameters would not be tuned during the training of AARM. Instead, a trainable matrix is defined to customize the pre-trained aspect embedding f (column vector in or ) to make it oriented towards our recommendation task. Then these customized embeddings are normalized as:
Here is the Euclidean norm of x
. In this paper, aspect interaction between two aspects is defined as the element-wise product of their embedding vectors. By normalizing the aspect embeddings with their corresponding Euclidean norms, the calculation of interaction between two aspects is similar to calculating their cosine similarity. As illustrated in(Mikolov et al., 2013), if two words have higher semantics and syntax similarities, their embeddings generated by Word2vec would have larger cosine similarity. In this way, the results of aspect interactions are associated with the semantics and syntax relations between aspects, which helps in identifying the synonymous and related aspects. Alternatively, we can also directly tune the aspect embedding matrix during the training of AARM for top-N recommendation. We will compare the performances of these different settings in the experiment section.
Aspect Interaction Layer
This layer maps the vector representations of aspects in and to a set of -dimensional interacted vectors. The aspect interaction between aspect and is defined as the element-wise product of their embedding vector and . Hence the output of the aspect interaction layer can be represented as a set of vectors:
Here is the masking indicator, where if is the meaningless aspect (defined for padding). To implement the masking operation in AARM, we define an aspect masking vector , where the column of aspect is a zero vector, and the columns of other aspects in are vectors of ones. Before calculating the interactions between the aspect and aspects in , we first calculate the element-wise product between and its corresponding column in . After the masking operation, the embedding vector of aspect is transformed into a zero vector. In this way, we make sure that the interactions between aspect and other aspects are zero vectors. As shown in the following sections, these zero vectors would not influence AARM’s final predictions.
As shown in Equation (3), besides the same aspects, the interactions between different aspects (when ) are also calculated. This is because we want to model the interactions between synonymous and similar aspects to alleviate the problem that the same aspects shared in a user’s reviews and a product’s reviews are usually very sparse. However, interactions between unrelated aspects are also considered in Equation (3). To emphasize on interactions between related aspects and filter out noisy interactions, the aspect-level attentive pooling layer is stacked above this layer.
Aspect-level Attentive Pooling Layer
In the aspect interaction layer, for each aspect , we calculate its interaction with all the aspects in . Intuitively, some aspect interactions should be given more attention than others. For example, the interactions between the same, synonymous or similar aspects usually contain more information about the product’s performance on the user’s concerned aspects. Hence an attention module is designed to focus on important aspect interactions. Word2vec embeddings of similar words would have higher cosine similarities (Mikolov et al., 2013). Inspired by this, for aspect pair and , the input of attention layer is defined as the element-wise product of their normalized embedding vector and to mimic the cosine similarity between their embedding vectors. And the aspect-level attention layer is defined as:
Here is a learnable vector, and is the attention value of the interaction between aspect and .
To estimate the product’s performance on the user’s aspect , we compress all the interactions between and aspects in with a weighted sum pooling where is used as the weight:
The output of this layer is the vector set .
User-level Attentive Pooling Layer
We can integrate the vector set , which represents how the product fits the user’s requirements on each aspect, and thus to produce estimation of the user’s overall satisfactory on this product. Intuitively, different users may focus on different aspects even when purchasing the same products. For example, when purchasing a cell phone, some users are more concerned about battery duration while some other users are more concerned about the performance of CPU. Furthermore, when purchasing different products, a user’s most concerned product features may be different. In other words, a user’s attention towards a aspect when purchasing a specific product is influenced by the characteristics of the user, the aspect and the product simultaneously.
To estimate user ’s interest towards aspect when purchasing a specific product , a user-level attentive pooling layer is designed in AARM. The input of this attention layer should contain not only information of current aspect , but also information of product . Intuitively, if an aspect is more important to product , the user should pay more attention to the aspect as compared with other unrelated aspects in . The importance of the user’s aspect with respect to a product can be measured by the similarities between and the aspects that has been mentioned in ’s reviews (i.e., aspects from ). To calculate the importance of aspect with respect to product , the interactions between and all the aspects in are calculated and summed up:
As the interaction between two aspects represents their similarity, represents the overall similarity between the aspect and the product . To measure the importance of different aspect , is used as aspect ’s input to the user-level attention layer. The attention layer is defined as:
Here is a learnable vector, and represents the importance of aspect in user ’s preferences with regard to product . This attention layer is different from the aspect-level attention layer defined in Equation (4) as and are two different vectors.
Finally, we compress the vector set with a weighted sum pooling to generate a vector which represents user ’s overall satisfaction towards product :
Here is the output of Aspect Interactions Module.
3.3. Global Interactions Part
To model the implicit factors which are not mentioned in review text but have influence over users’ satisfaction, AARM assigns a latent factor for every user and product respectively. In this module, embedding matrix is defined to project user to , and the embedding matrix is defined to project product to . These two embedding matrices are randomly initialized and tuned during the training for top-N recommendation. Then the global interaction between user and product is calculated in a way similar to that in vanilla latent factor models:
Here is the output of this part.
3.4. Output Layer
To merge information from the aforementioned two modules, and
are concatenated into one vector. And a regression layer without an activation function is stacked above it:
Here belongs to . represents user ’s overall satisfaction score towards product .
In this paper, we binarize the ratings scores and train AARM with a learning-to-rank method. Ranking methods are widely used in information retrieval (Liu, 2009; Luo et al., 2018; Hong et al., 2017) and recommendation models (Pan et al., 2019; He et al., 2017). In AARM, we use Bayesian Personalized Ranking (BPR) which is a pair-wise method. This makes AARM suitable for recommendation with implicit feedbacks. Given a user , a triple (, , ) is constructed for pair-wise training. Here, refers to the product that has purchased, while refers to an unpurchased one. During training, the positive user-product pair (, ) is drawn from rating set , which is accompanied with one negative pair (, ), where is randomly sampled from ’s unpurchased products. Intuitively, AARM should give higher satisfaction score to the positive pair (, ) than the negative pair (, ). Hence, the BPR optimization criterion is employed as the objective function of AARM:
refers to the sigmoid function, andis the number of positive pairs (,) in .
To prevent the possible overfitting, regularization is used on user and product embedding matrix and the kernel matrix of the output layer. As shown in Equation (12), to implement the regularization, we first calculate the mean values of element-wise square of these three matrices. The results are then multiplied by the regularization coefficient
and added to the loss function:
Here controls the regularization strength, refers to the -norm of the matrix W, and refers to the number of elements in the matrix W. We minimize the loss function to fit AARM from data.
Besides regularization, we also use dropout (Srivastava et al., 2014) to reduce overfitting. Dropout can prevent complex co-adaptations on training data by randomly dropping some units during training (Srivastava et al., 2014). Dropout is employed on the output of Global Interactions module and the output of Aspect Interactions module.
Aspect Embedding Pre-training. In our experiments, gensim’s implementation222https://radimrehurek.com/gensim/ of Word2vec is used to train the aspect embeddings. Before training embeddings with Word2vec, we first construct a dictionary for every dataset and then segment the reviews of each dataset into lists of words or phrases according to this dictionary. All the aspects (in the form of words or phrases) of each dataset are added into the corresponding dictionary to make sure that the Word2vec tool can recognize all the aspects and train embedding vectors for them. For each dataset, all the reviews in the training set are used for the training of aspect embedding. These embedding vectors are used as the initial values of the aspect embedding matrix , which would not be tuned during the training for top-N recommendation.
In this section, we design experiments to study the following research questions:
RQ1 Can AARM outperform state-of-the-art methods on top-N recommendation task?
RQ2 Can the interactions between different aspects improve the performance of AARM?
RQ3 Can the modeling of varied user interests improve the performance of AARM?
RQ4 How does the initialization and tuning strategy of aspect embedding influence the performance of AARM?
RQ5 What are the contributions of the Global Interaction part and Aspect Interaction part in the overall performance of AARM?
In the rest of this section, we will first introduce experimental settings, and then successively answer the above research questions with not only quantitative experiments but also qualitative case studies.
|Movies and TV||1,697,533||123,960||50,052||0.0274%|
|CDs and Vinyl||1,097,592||75,258||64,421||0.0226%|
|Clothing, Shoes and Jewelry||278,677||39,387||23,033||0.0307%|
|Cell Phones and Accessories||194,439||27,879||10,429||0.0669%|
We use the ”5-core” subsets from the publicly accessible “Amazon product dataset”333http://jmcauley.ucsd.edu/data/amazon/ (He and McAuley, 2016) for experiments. Here the “5-core” means that each user and product in the subset has at least five reviews. Each record in the dataset is composed of five variables including user, product, rating, textual review and helpfulness votes. In AARM, we only use user, product and textual review. To follow the setting of baseline methods, in our pair-wise learning-to-rank framework, ratings are binarized to construct positive user-product pairs. We adopt five different product categories from the “Amazon product dataset”, i.e., ‘Movies and TV’, ‘CDs and Vinyl’, ‘Clothing, Shoes and Jewelry’, ‘Cell Phones and Accessories’ and ‘Beauty’. Some detailed statistics including the sparsity and the number of ratings (#Rating), users (#User) and products (#Product) of the five datasets are summarized in Table 1. Sparsity is defined as . We can see that the five datasets are of different sizes and different levels of sparsity, which could cover different recommendation scenarios.
For each user, its 70% records are randomly selected as training set, while the rest of 30% records are put into test set. Particularly, we use the exact same splits and evaluation measures as the experimental settings in (Zhang et al., 2017)444We would like to thank the authors for sharing us with the datasets and specific splits.. This is to guarantee that all the methods are evaluated on exactly the same settings for fair comparisons.
4.2. Aspects from User Reviews
Some detailed statistics of the aspects extracted from user reviews by Sentires are shown in Table 2. We can see that the number of aspects (Aspect#), the average number of aspects per user (Ave. # Aspect/User) and the average number of aspects per product (Ave. # Aspect/Product) in the five datasets are varied, which makes our experiments more comprehensive.
|Dataset||#Aspect||Ave. #Aspect/User||Ave. #Aspect//Product|
|Movies and TV||2865||14.72||32.24|
|CDs and Vinyl||4033||31.04||41.31|
|Clothing, Shoes and Jewelry||525||7.04||9.77|
|Cell Phones and Accessories||648||6.93||12.50|
Table 3 shows some examples of the aspects extracted from each dataset. We did not conduct any post-processing on the extracted aspects. Although there are some noise words in the aspect collection, Sentires is largely effective in extracting many meaningful aspects that correspond to important product features. And there are synonymous aspects like “songwriters” and “composers”, and related aspects like “smell” and “chocolate smell”, which would usually be treated as disparate product features in most existing aspect-level models.
|Movies and TV||3d movie, cast, halloween film, halloween movie, harden,|
|melodrama, movie star, screen time, thrillers, zombie movie|
|CDs and Vinyl||1980s, band, crooners, crooning, country musics,|
|fingerwork, singers, rock fans, songwriters, composers|
|Clothing, Shoes and Jewelry||color, cottony, diamonds, fit, price,|
|presentation box, sleeve shirts, sleeve, traction, torso|
|Cell Phones and Accessories||usb, accessory, a little, car chargers, car speaker,|
|charge cycle, charge cycles, looks, plastic, quality|
|Beauty||results, smell, chocolate smell, odor, ingredient,|
|ingredients, face feeling, hair feeling, sheen, shampoos|
4.3. Evaluation Protocols
To generate a top-N recommendation list for user , a model first estimates the scores of ’s candidate products, then ranks all the candidate products according to the scores and truncates the ranking list at . In this paper, ’s candidate products include all the products in ’s test set and those that have not been purchased by . In the evaluation, products in ’s test set would be used as ground truth. Following the settings in (Zhang et al., 2017), we set . Four standard metrics are used in the evaluation: Recall, Precision, Normalized Discounted Cumulative Gain (NDCG) and Hit Ratio (HT).
Recall is the percentage of products that has been recommended to the user in the products that has been purchased by the user:
where is the number of ground truth products in the recommendation list, and is the number of ground truth products. We average the measure across all testing users.
Precision is the percentage of products which has been purchased by the user in the top-N recommendation list:
The measure is averaged across all testing users.
NDCG is a measure when the positions of the purchased products in the recommendation list are considered. NDCG is based on the Discounted Cumulative Gain (DCG):
Here is the graded relevance of the product at position of the recommendation list for a user. The NDCG of a user is then calculated as:
Here IDCG is the DCG of the ideal recommendation list where the user’s ground truth products are all ranked at the top. We average NDCG across all testing users.
HT is defined as in the following equation where is the number of users who has purchased at least one product in its recommendation list:
4.4. Baselines and Parameter Settings
We compare our method AARM with the following baselines.
BPR-MF (Rendle et al., 2009). The matrix factorization (MF) based on Bayesian Personalized Ranking (BPR), which combines MF-model with a pair-wise learning to rank loss function, is a solid baseline for top-N recommendation. Only user-product interaction data is used in this method.
BPR-HFT (McAuley and Leskovec, 2013). The Hidden Factor and Topics (HFT) model associates topics extracted from reviews with latent factors learned from numerical ratings. It is one of the state-of-the-art review-based recommendation methods. The original HFT model is a rating prediction method. BPR-HFT (Zhang et al., 2017) modifies HFT by adding a Bayesian Personalized Ranking loss on top of HFT to generate the top-N recommendation.
GMF (He et al., 2017). Generalized Matrix Factorization (GMF) is one of the state-of-the-art neural network based recommendation method which only utilizes user-product interaction records. In experiments, we directly use the released code by the authors 555https://github.com/hexiangnan/neural_collaborative_filtering.
BPR-AFM (Xiao et al., 2017). Attentional Factorization Machine (AFM) is an improved variant of the famous factorization machine (FM) (Rendle, 2010). Similar to our method, AFM uses a neural attention network to discriminate the importance of different feature interactions. The original version of AFM is designed for regression task and optimizes the squared loss. We modified AFM by adding a Bayesian Personalized Ranking loss on top of AFM to generate the top-N recommendation. Given a user and an item as input, we use the user identity, the item identity, the user’s aspects and the item’s aspects as features. Both the identity features and aspect features have corresponding embedding vectors in the model, which are randomly initialized and then fine-tuned during the training.
DeepCoNN (Zheng et al., 2017). The Deep Cooperative Neural Network is one of the state-of-the-art deep learning methods for recommendation which utilizes reviews to build user and product representations. It uses the review-based user and product representations for rating prediction.
JRL (Zhang et al., 2017). The Joint Representation Learning model is a state-of-the-art method which integrates different information sources with deep learning methods for top-N recommendation. Textual reviews, product images and numerical ratings are jointly used in JRL.
JRL-Review (Zhang et al., 2017). JRL-Review is a single-view version of JRL which incorporates textual reviews for top-N recommendation. JRL-Review employs PV-DBOW model (Le and Mikolov, 2014) to learn the vector representations of users and products from their corresponding reviews. It is one of the state-of-the-art review-based recommendation methods.
eJRL (Zhang et al., 2017). eJRL is another variant of JRL which jointly utilizes textual reviews, product images and numerical ratings for recommendation. The difference between them is that eJRL prevents information propagation among different information sources.
The hyper-parameters of baselines are tuned on training set with five-fold cross-validation. In particular, the dimension of latent factors (or embeddings) for BPR-MF, BPR-HFT and DeepCoNN is 100. For BPR-HFT, the number of topics is 10. For JRL, JRL-Review and eJRL, the embedding size is set as 300. For GMF and BPR-AFM, the size of all the embedding vectors is set as 128.
We implemented our methods with Tensorflow666https://www.tensorflow.org/. When padding user aspect set to the same size, the maximum size
was defined as the 75% quantile of the sizes of all user aspect sets. Similarly, the maximum sizeof product aspect set was defined as the 75% quantile of the sizes of all product aspect sets. For embedding layers, we set the dimension of user and product embeddings in the global interactions module to 128; set the dimension of aspect embeddings to 128. AARM was optimized with mini-batch Adam (Kingma and Ba, 2014)
because Adam uses adaptive learning rates for parameters with different update frequencies and converges faster than vanilla stochastic gradient descent. We tested the learning rate of [0.001, 0.003, 0.01]. For the coefficient ofregularization, [0.0, 0.0001, 0.01, 0.1] was tested. To prevent overfitting, in dropout layers, the dropout rate was set to 0.5. When pre-training aspect embeddings with Word2Vec, the window size and the number of noise words for negative sampling are both 5.
The model was trained for a maximum of 300 epochs with early stopping. To build the validation set, 1000 users are randomly selected from the users in the training set. For each user, one of his purchased products is randomly drawn from training set as the ground truth product in validation set. And when evaluating the model on the validation set, for each user, all the products which are not paired with the user in training set are added to the candidate set. Then to build recommendation list for each user, products in the candidate set are ranked according to the estimated matching degrees between them and the user. The aforementioned four measures are used to evaluate the top-N recommendation lists and then averaged across all the validation users. For every 10 epoch, we will test the model’s performance on the validation set. The training would be stopped if half of the four measures decreased for 40 successive epochs.
4.5. Model Comparison (RQ1)
Tables 4 and 5 show the performance of our method and baselines on top-N recommendation task. The performances of rating-based methods (BPR-MF and GMF), review-based methods (BPR-HFT, DeepCoNN, BPR-AFM and JRL-Review), multi-modal methods (eJRL and JRL) and our method (AARM) are shown in the four blocks in each table from top to bottom. The last block of each table also presents the percentage of improvements (or decrements for negative values) achieved by AARM as compared with the best review-based baseline (Impr-JRL-Review) and the best multi-modal baseline (Impr-JRL or Impr-eJRL). The best results are highlighted in bold. As we use the same split as (Zhang et al., 2017), we directly reproduce their results of BPR-MF, BPR-HFT, DeepCoNN, JRL-Review, eJRL and JRL for fair comparisons. From Tables 4 and 5, we can see that:
(1) In general, neural network based methods outperform shallow models (e.g. BPR-MF and BPR-HFT). GMF, which only uses user-product interaction data, even largely outperforms BPR-HFT which incorporates reviews for recommendation. This might be attributed to the powerful representation learning capacity of neural models.
(2) Generally, review-based methods outperforms rating-based methods. All the review-based methods outperforms BPR-MF. Among neural network based methods, BPR-AFM and JRL-Review also outperforms GMF. This shows that review is an important information source to boost recommendation performance.
(3) Our proposed method AARM outperforms all the rating-based methods and review-based methods on all the datasets in terms of different metrics. Compared to these baselines, AARM make better use of the user-product interaction records and review texts. This is because of AARM’s finer-grained modeling of aspect interactions, which simultaneously considers the interactions between different aspects and user’s varied attentions towards aspects. In the following sections, we further analyze how the specific designs of AARM boost its recommendation performance.
(4) AARM also outperforms both of the multi-modal deep learning methods on all the datasets and on all the measures. It is surprising that our method outperforms these multi-modal deep learning methods which not only utilize review data but also leverage product image and numerical rating data for recommendation. This further indicates that textual review is a very informative information source and AARM’s finer-grained aspect modeling could effectively leveraged reviews for recommendation. In the following sections, we will discuss the contribution of each part of AARM by comparing AARM with its variants.
4.6. Effect of Interactions between Different Aspects (RQ2)
Previous aspect-based methods neglect the interactions between synonymous and similar aspects when making recommendations, and are limited by the sparsity of shared aspects in the reviews of users and products. AARM alleviates this problem by modeling the interactions between different aspects and using an attention module to capture the important aspect interactions. To verify the effect of this design, we compare AARM with its two variant, which are termed as “A_Inter” and “No-AspectAtt” in Figure 3, under the same experimental settings.
As variants of AARM, the differences between AARM, No-AspectAtt and A_Inter are in Aspect Interactions part. Given a user and a product , A_Inter only considers the interactions between shared aspects of and , i.e., . Hence in the Aspect Interactions part of A_Inter, we first calculate the intersection of and . To estimate which represents ’s preference to according to aspect , the Equations (3), (4) and (5) of AARM are replaced with the following equation:
Here is an indicator, where if is the meaningless aspect defined for padding. As A_Inter only considers interactions between the same aspects, no aspect-level attention module is used here. In No-AspectAtt, the aspect-level attention layer are removed and the aspect interactions are directly summed up. The Equation (4) and (5) of AARM are replaced with the following equation:
We evaluate A-Inter and No-AspectAtt’s performance on top-N recommendation task and compare them with AARM in Figure 3. All the experimental settings are kept the same to ensure the reliability of results. As shown in Figure 3, AARM substantially outperforms A_Inter and No-AspectAtt on all datasets in terms of all measures. Compared to A-Inter, the average improvements achieved by AARM are 39.401% for NDCG, 37.427% for recall, 32.823% for HT and 33.593% for precision. The results demonstrate the importance of modeling the interactions between different aspects and the effectiveness of our carefully designed aspect-level attentive layer. We will further perform qualitative analysis of the aspect-level attention layer in Section 4.10.
4.7. Effect of Varied User Interest Modeling (RQ3)
In the design of AARM, we assume that user’s interests towards aspects are varied among different products. And an user-level attentive pooling layer (Equation (6), (7) and (8)), which simultaneously considers user, product and aspect information, is designed to capture user’s different biases towards aspects when facing different products. To verify the effect of the user-level attention module, we design two variants of AARM, called A_Static and No-UserAtt, and compare them with AARM on top-N recommendation task under the same settings.
The corresponding precision and recall results of AARM and its variants on five datasets for RQ3.
The differences between AARM, A_Static and No-UserAtt are in the design of user-level attention module. A_Static also assumes that user’s interests towards different aspects are different. But different from AARM, A_Static assumes that a user’s interests towards aspects are fixed when facing different products. Therefore, the inputs of the user-level attention layer in A_static do not consider the information of candidate products. When estimating user ’s interests towards its aspects, different from AARM, the input of the aspect is designed as:
Here is the overall representation of aspects in . And , which represents a summation of the similarities between aspect and all the aspects in , is aspect ’s input to the user-level attention layer.
Similar to Equation (7), The attention layer is defined as:
Here , and represents the importance of aspect with respect to the user . From Equations (20) and (21), we can see that no product information is used in the user-level attention module.
Different from AARM, No-UserAtt assumes that a user would assign equal weights to its aspects when purchasing products. So instead of the user-level attentive pooling layer, No-UserAtt directly sums up the set of vectors which represents the candidate product’s performances on the aspects of user :
As shown in Tables 6 and 7, AARM outperforms A_static and No-UserAtt on all the datasets and on all the measures. Remind that the only differences between AARM and A_static are the different assumptions about user attentions on aspects towards different products. From the results, we can see that AARM’s varied user interests assumption is more reasonable as compared to the constant user interests assumption of A_static. In real-life scenarios, a user could be interested in many different kinds of products and each product can be described by a specific set of aspects. Obviously the user will pay less attentions to the aspects which are not related to the current product. As no two products are exactly alike, a user’s interests on the diverse aspects can be varied even for the products from the same category. We will further represent how the user-level attentive pooling works when facing different products in Section 4.10.
In Tables 6 and 7, A_static also outperforms No-UserAtt on all the datasets in general. As A_static can be viewed as an enhanced version of No-UserAtt, where a fixed user interests model is added, we can see that identifying the different importance of aspects can boost the recommendation performance. This result is reasonable because different users have different tastes, and they would put different attentions to different product features.
4.8. Effects of Initialization and Tuning Strategy of Aspect Embedding (RQ4)
In AARM, the embeddings of aspects are first initialized with the vectors which are pre-trained with Word2vec on each dataset, and then transformed by the matrix . This is inspired by the findings in (Mikolov et al., 2013) that the word embeddings trained with Word2vec can retain the syntactic and semantic similarity relation between words. We keep the aspect embedding matrix fixed during the training of AARM for top-N recommendation while the matrix are tunable during the training. We choose this tuning strategy because similar words will be shifted similarly as shown in (Goldberg, 2017).
There are also other two alternatives for the initialization and tuning strategies of aspect embedding matrix . The first one is to randomly initialize the aspect embedding matrix and then tune it during the training for top-N recommendation. We conducted experiments under this setting and presented the results in Tables 8 and 9 in the row of “Random+Tune”. The second choice is to initialize the aspect embedding matrix with pre-trained embeddings and then tune it during the training for top-N recommendation. The experiment results of the second settings is presented in Tables 8 and 9 in the row of “Pretrain+Tune”.
|Random vs. Pretrain||-3.296||-2.311||-0.519||-2.321||85.411||77.713||3.420||1.227||2.920||5.756|
|Random vs. Pretrain||-3.692||-3.025||-3.376||-7.187||83.952||76.852||2.125||2.252||4.577||3.582|
As shown in Tables 8 and 9, AARM with the “pretraining + trainable linear transformation” strategy outperforms Random+Tune and Pretrain+Tune on all the datasets and on all the measures. The results are reasonable because in the design of the attention layers in AARM, we assumed that the similarity between two aspects can be represented by the interaction between them. The capability of enabling similar words shifted similarly makes the “pretraining + trainable linear transformation” strategy more suitable for our task.
Comparing the performance of Random+Tune with Pretrain+Tune in Tables 8 and 9, we can find that Pretrain+Tune outperforms Random+Tune in larger datasets like “Movies and TV” and “CDs and Vinyl” (refer to Table 1), while Random+Tune performs better in smaller datasets like “Clothing, Shoes and Jewelry”, “Cell Phones and Accessories” and “Beauty” (refer to Table 1). This may be caused by the fact that when the training data is not sufficient, the Pretrain+Tune strategy may not be able to transform the pre-trained embeddings for the new task and thus lose the original similarity between words (Goldberg, 2017). Random+Tune strategy which assigns a much smaller random initial values to embedding matrix is easier to be optimized for the new task in an end-to-end style.
4.9. Model Ablation: Effect of Global Module and Aspect Module (RQ5)
In this section we examine the roles of the Global Interactions part and Aspect Interactions part in the results of AARM. As shown in Figure 1, given the user and product as input, the two parts of AARM worked separately. Then the outputs of these two parts are merged and input into the output layer to estimate the score. To verify the effect of the Aspect Interactions part, we remove the Global Interactions part from AARM, and directly input the result of Aspect Interactions part into the output layer. This variant of AARM is referred as “Aspect Part” in Tables 10 and 11. Similarly, another variant of AARM which is referred as “Global Part” in Tables 10 and 11 is constructed by removing the Aspect Interactions part from AARM to verify the effect of Global Interactions Part.
|Aspect vs. Global||-20.890||-17.341||6.996||8.007||54.705||58.664||-13.918||-8.267||-9.955||1.774|
|Aspect vs. Global||-21.695||-22.388||13.698||-1.799||62.320||52.881||-8.113||-11.285||-2.188||-17.120|
|Cell Phones and Accessories||26.34%||28.95%||19.73%||11.31%||6.09%||3.26%||4.33%|
|Clothing, Shoes and Jewelry||12.09%||24.90%||25.50%||17.92%||10.07%||5.01%||4.50%|
|Movies and TV||30.98%||27.26%||15.54%||8.73%||5.21%||3.30%||8.98%|
|CDs and Vinyl||3.06%||10.71%||13.91%||13.36%||11.39%||9.29%||38.27%|
From Tables 10 and 11, we can find that AARM significantly outperforms Aspect Part and Global Part. This result indicates that our combination strategy based on concatenation is valid. And the Global Interactions part, which is designed to capture the user preferences that have not been mentioned in review texts, is an effective complement to the Aspect Interactions part.
As compared with Global Part, Aspect Part performs better in two datasets while falls behind in the other three datasets. Because Aspect Part connects users and products via the interactions between their aspects, its performance may be influenced by the number of interactions between related aspects. To verify this viewpoint, We traverse all the users and products in a dataset to construct all the possible user-product pairs, and then count the number of shared aspects of each user-product pair. A shared aspect of a user-product pair is a aspect which has been mentioned in both the user and the product’s reviews. The distributions of the number of shared aspects of each user-product pair on the five datasets are shown in Table 12.
From Tables 10, 11 and 12, we can find that Aspect Part usually performs better on datasets which have more shared aspects between each user-product pair in general. For example, Aspect Part substantially outperforms Global Part in “CDs and Vinyl” and “Clothing, Shoes and Jewelry” datasets which have the smallest ratios of 0 shared aspects (see the 2nd column in the table). And for datasets “Movies and TV”, “Cell Phones and Accessories” and “Beauty” where more than 20% user-product pairs do not have any shared aspects, Global Part outperforms Aspect Part.
4.10. Case Study of Attention Layers
The user-level and aspect-level attention modules are important parts of AARM. The user-level attention module (refer to Equation 7) is employed to capture user’s varied preferences on aspects. And the aspect-level attention module (refer to Equation 4) is designed to enhance the interactions between meaningful aspect pairs, like the interactions between the same or similar aspects, and reduce the influence of the interactions between the two irrelevant aspects. To illustrate the roles of these two attention modules in AARM, we randomly selected some examples for qualitative analysis.
In Table 13, we show the user-level attention values of a user ‘A1P9UMP1XSE6MI’ in “Cell Phones and Accessories” dataset when examining different products. The first column is the ids of four products in the dataset and their aspect sets. Each product has 15 aspects which is the 75% quantile of the sizes of all product aspect sets in the dataset. The rest of columns show the aspects of the user (the second row from top to bottom) and the attention values that assigned to these aspects when facing aforementioned four products. From each product’s aspect set, we can find that product ‘B00EOE6FUW’ is a ‘usb charger’, ‘B005HS5MKS’ is a ‘bluetooth earpiece’, and ‘B002VPE1NO’ and ‘B00E8GYIRI’ are the ‘shell case’ of cell phones. The shared aspects of each user-product pair and corresponding attention values are highlighted in red.
|Aspects of User A1P9UMP1XSE6MI|
|Products and Their Aspects||
As shown in Table 13, when examining a product, the user-level attention module can find the aspects which are related to the product and assign higher attention values to them. First, all the shared aspects (highlighted in red) of each user-product pair are assigned much higher attention values. Second, the user-level attention module can assign higher values to aspects that are related to the product but have not been mentioned in the product’s reviews. For example, when examining the shell cases ‘B002VPE1NO’ and ‘B00E8GYIRI’, ‘grab’ is assigned higher weight although it is not in the product’s aspect set. This is because that there are some related aspects of ‘grab’ in the two products’ aspect sets which are captured by our attention module (refer to Figure 4).
The examples in Table 13 indicate why AARM can outperform A_Static and No-UserAtt (refer to Tables 6 and 7). The user’s aspect set consists of three unrelated kinds of aspects: 1) ‘sound quality’, ‘quality’ and ‘bluetooth earpiece’; 2) ‘usb cords’ and ‘usb plug’; 3) ‘shell case’, ‘grommets’, ‘impact protection’ and ‘grab’. In this case, No-UserAtt would assign same weights to aspect ‘bluetooth earpiece’ and ‘shell case’ when purchasing a bluetooth earpiece. And A_Static would assign same weights to aspect ‘sound quality’ no matter what kinds of products the user is purchasing. By identifying different aspects’ different roles when purchasing different products, AARM achieved better performance.
Next we present how the aspect-level attention module finds the meaningful interactions (i.e., interactions between the shared aspects, synonymous aspects and similar aspects) from all the aspect interactions between a user and a product. In Figure 4, we show the aspect-level attention values of the interactions between aforementioned user ‘A1P9UMP1XSE6MI’ and product ‘B002VPE1NO’. In the heat map, the columns refer to the product’s aspects while the rows refer to the user’s aspects. The color of each grid cell represents the attention value assigned to the corresponding interaction. The darker of the color in a grid cell, the higher of the attention value.
First, we can see that interactions between the shared aspects like ‘grommets’, ‘impact protection’ and ‘shell case’ are captured and assigned higher attention values. Second, the interactions between synonymous aspects are assigned higher weights as compared with unrelated ones. For example, (‘shell case’, ‘shell’) is assigned the second highest attention value in the interactions between ‘shell case’ and the product’s aspects. Third, some interactions between similar aspects are captured. For example, in the interactions with ‘impact protection’, the product’s aspects ‘protection’, ‘armor’ and ‘grip’ are assigned high attention values. Finally, for the user’s aspects that are unrelated to the product (e.g. ‘usb plug’), their attention value distributions are more uniform compared to the shared and similar aspects. By assigning higher attention values to meaningful aspect interactions, AARM can alleviate the impact of noisy interactions and overcome the aspect sparsity problem.
5. Conclusion and Future Work
In this paper, we presented a Attentive Aspect-based Recommendation Model (AARM), which carefully capture the interactions between aspects extracted from reviews for recommendation. AARM first calculates the interactions between aspect embeddings to estimate how a product fits a user’s requirements on each aspect, and then estimates the user’s overall satisfactory on the product by synthesizing the product’s performances on each aspect. To deal with the problem that the number of shared aspects between a user and a product is often limited, AARM takes the interactions between different aspects into consideration. With a well-designed aspect-level attention module, not only the shared aspects but also other related aspect pairs can be selected and assigned higher attention values. In addition, we hold the assumption that a user’s interests towards aspects are varied when examining different products. To achieve the goal, an attention module which simultaneously considers user and product information is designed in AARM. In the experiments on five real-world datasets, AARM outperforms the state-of-the-art methods on the top-N recommendation task. In particular, compared with multi-modal (textual reviews, product images and numerical ratings) methods JRL and eJRL, AARM can still achieves better results in all datasets. To demonstrates the effectiveness of each component in AARM, a lot of quantitative experiments and qualitative case studies are conducted.
In the future, we would like to extend our work in the following three ways: (1) Applying our method to capture the similarity relation between two different aspects to other recommendation scenes. By using the pre-trained aspect embedding, the aspect embedding transformation module and the aspect interaction layer, AARM can mimic the cosine similarity and capture the semantics and syntax similarities between two aspects. This strategy can also be used in other recommendation scenes (e.g. recommendation with tags or item metadata) to capture the relation between different elements (like tags or item categories). (2) Extracting aspects with neural network and combining it with AARM. In particular, we would like to jointly train the aspect extraction module and the recommendation module in an end-to-end style. Ideally, the end-to-end training could reduce noisy aspects and mine more domain-specific aspects. (3) Integrating aspect-level sentiment information in AARM. Aspect-level sentiment information is useful to identify user’s likes and dislikes about product features. But existing methods usually use external tools for aspect-level sentiment analysis, which relies on the accuracy of these tools and is usually not able to deal with new reviews. We will study how to extract these sentiment information and integrate them into AARM with end-to-end learning.
- Bauman et al. (2017) Konstantin Bauman, Bing Liu, and Alexander Tuzhilin. 2017. Aspect based recommendations: Recommending items with the most valuable aspects based on user reviews. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 717–725.
- Cao et al. (2018) Da Cao, Xiangnan He, Lianhai Miao, Yahui An, Chao Yang, and Richang Hong. 2018. Attentive group recommendation. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 645–654.
- Cao et al. (2017) Da Cao, Xiangnan He, Liqiang Nie, Xiaochi Wei, Xia Hu, Shunxiang Wu, and Tat-Seng Chua. 2017. Cross-platform app recommendation by jointly modeling ratings and texts. ACM Transactions on Information Systems 35, 4, Article 37 (2017), 37:1–37:27 pages.
- Catherine and Cohen (2017) Rose Catherine and William Cohen. 2017. Transnets: Learning to transform for recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 288–296.
- Chen et al. (2018) Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural Attentional Rating Regression with Review-level Explanations. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1583–1592.
- Chen et al. (2019) Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2019. Social Attentional Memory Network: Modeling Aspect- and Friend-level Differences in Recommendation. In The eleventh ACM International Conference on Web Search and Data Mining.
- Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335–344.
- Chen et al. (2015) Li Chen, Guanliang Chen, and Feng Wang. 2015. Recommender systems based on user reviews: the state of the art. User Modeling and User-Adapted Interaction 25, 2 (2015), 99–154.
- Chen et al. (2016) Xu Chen, Zheng Qin, Yongfeng Zhang, and Tao Xu. 2016. Learning to rank features for recommendation over multiple categories. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 305–314.
- Cheng et al. (2019) Zhiyong Cheng, Xiaojun Chang, Lei Zhu, Rose C Kanjirathinkal, and Mohan Kankanhalli. 2019. MMALFM: Explainable recommendation by leveraging reviews and images. ACM Transactions on Information Systems (TOIS) 37, 2 (2019), 16.
Cheng et al. (2018a)
Zhiyong Cheng, Ying Ding,
Xiangnan He, Lei Zhu,
Xuemeng Song, and Mohan Kankanhalli.
NCF: An adaptive aspect attention model for rating prediction. InProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 3748–3754.
- Cheng et al. (2018b) Zhiyong Cheng, Ying Ding, Lei Zhu, and Mohan Kankanhalli. 2018b. Aspect-aware latent factor model: Rating prediction with ratings and reviews. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 639–648.
- Chin et al. (2018) Jin Yao Chin, Kaiqi Zhao, Shafiq Joty, and Gao Cong. 2018. ANR: Aspect-based Neural Recommender. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 147–156.
- Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 39–46.
- Dong and Smyth (2017) Ruihai Dong and Barry Smyth. 2017. User-based opinion-based recommendation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 4821–4825.
- Ebesu et al. (2018) Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for Recommendation Systems. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, USA, 515–524.
- Ganu et al. (2013) Gayatree Ganu, Yogesh Kakodkar, and AméLie Marian. 2013. Improving the quality of predictions using textual information in online user reviews. Information Systems 38, 1 (2013), 1–15.
Neural network methods for natural language processing.Synthesis Lectures on Human Language Technologies 10, 1 (2017), 1–309.
- Guo et al. (2018) Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In Proceedings of the 2018 ACM on Multimedia Conference. ACM.
- He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee, 507–517.
- He et al. (2015) Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. TriRank: Review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 1661–1670.
- He and Chua (2017) Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 355–364.
- He et al. (2018) Xiangnan He, Zhenkui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. NAIS: Neural attentive item similarity model for recommendation. IEEE Transactions on Knowledge and Data Engineering (2018).
- He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182.
et al. (2017)
Richang Hong, Lei Li,
Junjie Cai, Dapeng Tao,
Meng Wang, and Qi Tian.
Coherent semantic-visual indexing for large-scale image retrieval in the cloud.IEEE Transactions on Image Processing 26, 9 (2017), 4128–4138.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Koren et al. (2009) Y. Koren, R. Bell, and C. Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.
- Liu (2009) Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (March 2009), 225–331.
- Luo et al. (2018) Xin Luo, Ye Wu, and Xin-Shun Xu. 2018. Scalable supervised discrete hashing for large-scale search. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1603–1612.
- McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 165–172.
- Meng et al. (2018) Xuying Meng, Suhang Wang, Huan Liu, and Yujun Zhang. 2018. Exploiting Emotion on Reviews for Recommender Systems. In AAAI Conference on Artificial Intelligence.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
- Pan et al. (2019) Weike Pan, Qiang Yang, Wanling Cai, Yaofeng Chen, Qing Zhang, Xiaogang Peng, and Zhong Ming. 2019. Transfer to Rank for Heterogeneous One-Class Collaborative Filtering. ACM Trans. Inf. Syst. 37, 1, Article 10 (Jan. 2019), 20 pages.
- Pero and Horváth (2013) Štefan Pero and Tomáš Horváth. 2013. Opinion-driven matrix factorization for rating prediction. In International Conference on User Modeling, Adaptation, and Personalization. Springer, 1–13.
- Rendle (2010) Steffen Rendle. 2010. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining. IEEE Computer Society, 995–1000.
- Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 452–461.
- Salton et al. (1975) Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
- Tan et al. (2018) Jiwei Tan, Xiaojun Wan, Hui Liu, and Jianguo Xiao. 2018. QuoteRec: Toward quote recommendation for writing. ACM Transactions on Information Systems 36, 3 (2018), 34:1–34:36.
- Tan et al. (2016) Yunzhi Tan, Min Zhang, Yiqun Liu, and Shaoping Ma. 2016. Rating-boosted latent topics: Understanding users and items with ratings and reviews. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 2640–2646.
- Wang et al. (2016) Shuaiqiang Wang, Shanshan Huang, Tie-Yan Liu, Jun Ma, Zhumin Chen, and Jari Veijalainen. 2016. Ranking-oriented collaborative filtering: A listwise approach. ACM Transactions on Information Systems 35, 2 (2016), 10:1–10:28.
- Wang et al. (2018) Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie, and Tat-Seng Chua. 2018. TEM: Tree-enhanced Embedding Model for Explainable Recommendation. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1543–1552.
- Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 3119–3125.
- Xu et al. (2014) Yinqing Xu, Wai Lam, and Tianyi Lin. 2014. Collaborative filtering incorporating review text and co-clusters of hidden user communities and item groups. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 251–260.
- Yang et al. (2017) Longqi Yang, Cheng-Kang Hsieh, Hongjian Yang, John P. Pollak, Nicola Dell, Serge Belongie, Curtis Cole, and Deborah Estrin. 2017. Yum-Me: A personalized nutrient-based meal recommender system. ACM Transactions on Information Systems 36, 1 (2017), 7:1–7:31.
- Zhang and Wang (2016) Wei Zhang and Jianyong Wang. 2016. Integrating topic and latent factors for scalable personalized review-based rating prediction. IEEE Transactions on Knowledge and Data Engineering 28, 11 (2016), 3013–3027.
- Zhang et al. (2017) Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Bruce Croft. 2017. Joint representation learning for top-n recommendation with heterogeneous information sources. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1449–1458.
- Zhang et al. (2014a) Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014a. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 83–92.
- Zhang et al. (2014b) Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. 2014b. Do users rate or review?: Boost phrase-level sentiment labeling with review-level sentiment classification. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1027–1030.
- Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 425–434.