Since the inception of the Netflix Prize competition, latent factor collaborative filtering (CF) has been continuously adopted by various recommendation tasks due to its strong performance over other methods 
, which essentially employs a latent factor model such as matrix factorization and/or neural networks to learn user or item feature representations for rendering recommendations. Despite much success, latent factor CF approaches often suffer from the lack of interpretability. In a contemporary recommender system, explaining why a user likes an item can be as important as the accuracy of the rating prediction itself .
Explainable recommendation can improve transparency, persuasiveness and trustworthiness of the system . To make intuitive explanation for recommendations, recent efforts have been focused on using metadata such as user defined tags and topics from user review texts or item descriptions[17, 4] to illuminate users preferences. Other works such as [14, 12, 24] use aspects to explain recommendations. Although these approaches can explain recommendation using external metadata, the interpretability of the models themselves and the interpretable features enabling the explainable recommendations have still not been systematically studied and thus, are poorly understood. It is also worth mentioning that the challenges in explainable recommendation not only lie in the modeling itself, but also in the lack of a gold standard for evaluation of explainability.
Here we propose a novel feature mapping strategy that not only enjoys the advantages of strong performance in latent factor models but also is capable of providing explainability via interpretable features. The main idea is that by mapping the general features learned using a base latent factor model onto interpretable aspect features, one could explain the outputs using the aspect features without compromising the recommendation performance of the base latent factor model. We also propose two new metrics for evaluating the quality of explanations in terms of a user’s general preference over all items and the aspect preference to a specific item. Simply put, we formulate the problem as: 1) how to find the interpretable aspect basis; 2) how to perform interpretable feature mapping; and 3) how to evaluate explanations.
We summarize our main contributions as follows: 1) We propose a novel feature mapping approach to map the general uninterpretable features to interpretable aspect features, enabling explainability of the traditional latent factor models without metadata; 2) Borrowing strength across aspects, our approach is capable of alleviating the trade-off between recommendation performance and explainability; and 3) We propose new schemes for evaluating the quality of explanations in terms of both general user preference and specific user preference.
2 Related Work
There are varieties of strategies for rendering explainable recommendations. We first review methods that give explanations in light of aspects, which are closely related to our work. We then discuss other recent explainable recommendation works using metadata and knowledge in lieu of aspects.
2.1 Aspect Based Explainable Recommendation
Aspects can be viewed as explicit features of an item that could provide useful information in recommender systems. An array of approaches have been developed to render explainable recommendations at the aspect level using metadata such as user reviews. These approaches mostly fall into three categories: 1) Graph-based approaches: they incorporate aspects as additional nodes in the user-item bipartite graph. For example, TriRank  extract aspects from user reviews and form a user-item-aspect tripartite graph with smoothness constraints, achieving a review-aware top-N recommendation. ReEL  calculate user-aspect bipartite from location-aspect bipartite graphs, which infer user preferences. 2) Approaches with aspects as regularizations or priors: they use the extracted aspects as additional regularizations for the factorization models. For example, AMF  construct an additional user-aspect matrix and an item-aspect matrix from review texts, as regularizations for the original matrix factorization models. JMARS  generalize probabilistic matrix factorization by incorporating user-aspect and movie-aspect priors, enhancing recommendation quality by jointly modeling aspects, ratings and sentiments from review texts. 3) Approaches with aspects as explicit factors: other than regularizing the factorization models, aspects can also be used as factors themselves.  propose an explicit factor model (EMF) that factorizes a rating matrix in terms of both predefined explicit features (i.e. aspects) as well as implicit features, rendering aspect-based explanations. Similarly, 
extend EMF by applying tensor factorization on a more complex user-item-feature tensor.
2.2 Beyond Aspect Explanation
There are also other approaches that don’t utilize aspects to explain recommendations. For example,  give explanations in light of the movie similarities defined using movie characters and their interactions; 
propose explainable recommendations by exploiting knowledge graphs where paths are used to infer the underlying rationale of user-item interactions. With the increasingly available textual data from users and merchants, more approaches have been developed for explainable recommendation using metadata. For example,[7, 8, 17] attempt to generate textual explanations directly whereas [22, 6] give explanations by highlighting the most important words/phrases in the original reviews.
Overall, most of the approaches discussed in this section rely on metadata and/or external knowledge to give explanations without interpreting the model itself. In contrast, our Attentive Multitask Collaborative Filtering (AMCF) approach maps uninterpretable general features to interpretable aspect features using an existing aspect definition, as such it not only gives explanations for users, but also learns interpretable features for the modelers. Moreover, it is possible to adopt any latent factor models as the base model to derive the general features for the proposed feature mapping approach.
3 The Proposed AMCF Model
In this section, we first introduce the problem formulation and the underlying assumptions. We then present our AMCF approach for explainable recommendations. AMCF incorporates aspect information and maps the latent features of items to the aspect feature space using an attention mechanism. With this mapping, we can explain recommendations of AMCF from the aspect perspective. An Aspect  is an attribute that characterizes an item. Assuming there are totally aspects in consideration, if an item has aspects simultaneously, an item can then be described by , . We say that an item has aspect , if .
3.1 Problem Formulation
Inputs: The inputs consist of parts: the set of users , the set of items
, and the set of corresponding multi-hot aspect vectors for items, denoted by.
Outputs: Given the user-item-aspect triplet, e.g. user , item , and aspect multi-hot vector for item , our model not only predicts the review rating, but also the user general preference over all items and the user specific preference on item in terms of aspects, i.e., which aspects of the item that the user is mostly interested in.
The trade-off between model interpretability and performance states that we can either achieve high interpretability with simpler models or high performance with more complex models that are generally harder to interpret . Recent works [23, 12, 24] have shown that with adequate metadata and knowledge, it is possible to achieve both explainability and high accuracy in the same model. However, those approaches mainly focus on explanation of the recommendation, rather than exploiting the interpretability of the models and features, and hence are still not interpretable from modeling perspective. Explainability and interpretability refer to “why” and “how” a recommendation is made, respectively. Many above-referenced works only answer the “why” question via constraints from external knowledge without addressing “how”. Whereas our proposed AMCF model answers both “why” and “how” questions, i.e., our recommendations are made based on the attention weights (why) and the weights are learned by interpretable feature decomposition (how). To achieve this, we assume that an interpretable aspect feature representation can be mathematically derived from the corresponding general feature representation. More formally:
Assume there are two representations for the same prediction task: in complex feature space (i.e. general embedding space including item embedding and aspect embedding), and in simpler feature space (i.e. space spanned by aspect embeddings), and . We say that is the projection of from space to space , and there exists a mapping , such that , with as a hyper-parameter.
This assumption is based on the widely accepted notion that a simple local approximation can give good interpretation of a complex model in that particular neighborhood . Instead of selecting surrogate interpretable simple models (such as linear models), we map the general complex features to the simpler interpretable aspect features, then render recommendation based on those general complex features. We give explanations using interpretable aspect features, achieving the best of both worlds in keeping the high performance of the complex model as well as gaining the interpretability of the simpler model. In this work, the interpretable simple features are obtained based on aspects, hence we call the corresponding feature space as aspect space. To map the complex general features onto the interpretable aspect space, we define the aspect projection.
To achieve good interpretability and performance in the same model, from Definition 1 and Assumption 1, we need to find the mapping . Here we first use a latent factor model as the base model for explicit rating prediction, which learns general features, as shown in Figure 2 (left, ), where we call the item embedding as the general complex feature learned by the base model. Then the remaining problem is to derive the mapping from the non-interpretable general features to the interpretable aspect features.
3.3 Aspect Embedding
To design a simple interpretable model, its features should be well aligned to our interest, e.g. the aspects is a reasonable choice. Taking movie genre as an example: if we use genres (Romance, Comedy, Thriller, Fantasy) as aspects, the movie Titanic’s aspect should be represented by because it’s romance genre, and the movie Cinderella’s aspect is because it’s genre falls into both romance and fantasy.
From Assumption 1 and Definition 1, to make the feature mapping from a general feature to an aspect feature , we need to first define the aspect space . Assuming there are aspects in consideration, we represent the aspects by latent vectors in general space , and use these aspect vectors as the basis that spans the aspect space . These aspects’ latent vectors can be learned by neural embedding or other feature learning methods, with each aspect corresponding to an individual latent feature vector. Our model uses embedding approach to extract aspect latent vectors of -dimension, where is the dimension of space . In Figure 2, the vertical columns in red () represent aspect embeddings in the general space , which is obtained by embedding the aspect multi-hot vectors from input.
3.4 Aspect Projection of Item Embedding
In Assumption 1, is the general feature representation (i.e. the item embedding) in space , and is the interpretable aspect feature representation in space . The orthogonal projection from the general space to the aspect space is denoted by , i.e. .
From the perspective of learning disentangled representations, the item embedding can be disentangled as (Figure 1), where encodes the aspect information of an item and is the item-unique information. For example, movies from the same genre share similar artistic style () yet each movie has its own unique characteristics (). With this disentanglement of item embeddings, we can explain recommendation via capturing user’s preference in terms of aspects.
Let’s assume that we have linearly independent and normalized aspect vectors in space , which span subspace . For any vector in space , there exists an unique decomposition such that . The coefficients can be directly calculated by , (, is normalized). Note that the second equality comes from the fact that is the orthogonal projection of on space .
Generally speaking, however, are not orthogonal. In this case, as long as they are linearly independent, we can perform Gram-Schmidt orthogonalization process to obtain the corresponding orthogonal basis. The procedure can be simply described as follows: where denotes inner product. We can then calculate the unique decomposition as in the orthogonal cases. Assume the resulting decomposition is , the coefficients corresponding to the original basis can then be calculated by:
Hence, after the aspect feature projection and decomposition, regardless of orthogonal or not, we have the following unique decomposition in space : .
Aspect Projection via Attention: As described above, any interpretable aspect feature can be uniquely decomposed as , which is similar to the form of attention mechanism. Therefore, instead of using Gram-Schmidt orthogonalization process, we utilize attention mechanism to reconstruct directly. Assume we can obtain an attention vector , which can be used to calculate , with the fact that the decomposition is unique, our goal is then to minimize the distance to ensure that .
However, as the interpretable aspect feature is not available, we cannot minimize directly. Fortunately, the general feature is available (obtained from a base latent factor model), with the fact that is the projection of , i.e. , we have the following lemma:
Provided that is the projection of from space to space , where , we have
where , , and denotes norm.
Refer to the illustration in Figure 1, and denote the difference between and as , i.e. . Hence
Note that is perpendicular to and , the right hand side can then be written as
as is not parameterized by , we then get
From the above proof, we know that attention mechanism is sufficient to reconstruct by minimizing . Note that from the perspective of disentanglement , the information in , i.e., the item specific characteristics, is not explained in our model. Intuitively, the item specific characteristics are learned from the metadata associated with the item.
3.5 The Loss Function
The loss function for finding the feature mappingto achieve both interpretability and performance of the recommender model has components:
prediction loss in rating predictions, corresponding to the loss function for the base latent factor model.
interpretation loss to the general feature . This loss is to quantify .
We calculate the rating prediction loss component using RMSE: where represents the predicted item ratings. We then calculate the interpretation loss component as the average distance between and : The loss component encourages the interpretable feature obtained from the attentive neural network to be a good approximation of the aspect feature representation (Lemma 1). Hence the overall loss function is , where is a tuning parameter to leverage importance between the two loss components.
Gradient Shielding Trick: To ensure that interpretation doesn’t compromise the prediction accuracy, we allow forward propagation to both and but refrain the back-propagation from to the item embedding . In other words, when learning the model parameters based on back-propagation gradients, the item embedding is updated only via the gradients from .
3.6 User Preference Prediction
Thus far we attempt to optimize the ability to predict user preference via aspect feature mapping. We call the user overall preference as general preference, and the user preference on a specific item as specific preference.
General preference: Figure 3 illustrates how to make prediction on user general preference. Here we define a virtual item , which is a linear combination of aspect embeddings. For general preference, we let to simulate a pure aspect movie, the resulting debiased (discarded all bias terms) rating prediction indicates the user’s preference on such specific aspect (e.g., positive for ‘like’, negative for ‘dislike’). Formally: , where is the user embedding, is the aspect ’s embedding, and is the corresponding base latent factor model without bias terms.
Specific preference: Figure 3 also shows our model’s ability to predict user preference on a specific item, as long as we can find how to represent them in terms of aspect embeddings. Fortunately, the attention mechanism is able to help us find the constitution of any item in terms of aspect embeddings using the attention weights. That is, for any item, it is possible to rewrite the latent representation as a linear combination of aspect embeddings: where and are the -th attention weight and the -th aspect feature, respectively. The term reflects interpretation loss. For aspect of an item, we use to represent the embedding of a virtual item which represents the aspect property of the specific item. Hence, the output indicates the specific preference on aspect of a specific item.
Model Interpretability: From specific preference, a learned latent general feature can be decomposed into the linear combination of interpretable aspect features, which would help interpret models in a more explicit and systematic manner.
4 Experiments and Discussion
We design and perform experiments to demonstrate two advantages of our AMCF approach: 1) comparable rating predictions; 2) good explanations on why a user likes/dislikes an item. To demonstrate the first advantage we compare the rating prediction performance with baseline approaches of rating prediction only methods. The demonstration of the second advantage, however, is not a trivial task since currently no gold standard for evaluating explanation of recommendations except for using real customer feedback[7, 10]. Hence it’s necessary to develop new schemes to evaluate the quality of explainability for both general and specific user preferences.
MovieLens Datasets This data set  offers very complete movie genre information, which provides a perfect foundation for genre (aspect) preference prediction, i.e. determining which genre a user likes most. We consider the movie genres as aspects.
Yahoo Movies Dataset This data set from Yahoo Lab contains usual user-movie ratings as well as metadata such as movie’s title, release date, genre, directors and actors. We use the movie genres as the aspects for movie recommendation and explanation. Summary statistics are shown in Table 1.
Pre-processing: We use multi-hot encoding to represent genres of each movie or book, where indicates the movie is of that genre, otherwise. However, there are still plenty of movies with missing genre information, in such cases, we simply set them as none of any listed genre, i.e., all zeros in the aspect multi-hot vector: .
|Dataset||# of ratings||# of items||# of users||# of genres|
4.2 Results of Prediction Accuracy
We select several strong baseline models to compare rating prediction accuracy, including non-interpretable models, such as SVD , Neural Collaborative Filtering (NCF)  and Factorization Machine (FM) 
, and an interpretable linear regression model (LR). Here the LR model is implemented by using aspects as inputs and learning separate parameter sets for different individual users. In comparison, our AMCF approaches also include SVD, NCF or FM as the base model to demonstrate that the interpretation module doesn’t compromise the prediction accuracy. Note that since regular NCF and FM are designed for implicit ratings (1 and 0), we replace their last sigmoid output layers with fully connected layers in order to output explicit ratings.
In terms of robustness, we set the dimension of latent factors in the base models to , and . The regularization tuning parameter is set to , which demonstrated better performance compared to other selections. It is worth noting that the tuning parameters of the base model of our AMCF approach are directly inherited from the corresponding non-interpretable model. We compare our AMCF models with baseline models as shown in Table 2. It is clear that AMCF achieves comparable prediction accuracy to their non-interpretable counterparts, and significantly outperforms the interpretable LR model.
4.3 Evaluation of Explainability
Despite the recent efforts have been made to evaluate the quality of explanation by defining explainability precision (EP) and explainability recall (ER)[18, 1], the scarcity of ground truth such as a user’s true preference remains a significant obstacle for explainable recommendation.  make an initial effort in collecting ground truth by surveying real customers, however, the labor intense, time consuming and sampling bias may prevent its large-scale applications in a variety of contexts. Other text-based approaches [8, 17]
can also use natural language processing (NLP) metrics such as Automated Readability Index (ARI) and Flesch Reading Ease (FRE). As we don’t use metadata such as text reviews in our AMCF model, user review based explanation and evaluation could be a potential future extension to our model.
Here we develop novel quantitative evaluation schemes to assess our model’s explanation quality in terms of general preferences and specific preferences, respectively.
4.3.1 General Preference
Let’s denote the ground truth of user general preferences as for user , and the model’s predicted preference for user is , we propose measures inspired by Recall@K in recommendation evaluations.
Top recall at (TM@K): Given the most preferred aspects of a user from , top recall at is defined as the ratio of the aspects located in the top highest valued aspects in . For example, if indicates that user ’s top preferred aspects are Adventure, Drama, and Thriller, while the predicted shows that the top are Adventure, Comedy, Children, Drama, Crime, the top recalls at (T3@5) is then whereas top recall at (T1@3) is .
Bottom recall at (BM@K): Similarly defined as above, except that it measures the most disliked aspects.
As the ground truth of user preferences are usually not available, some reasonable approximations are needed. Hence we propose a method to calculate the so-called surrogate ground truth. First we define the weights , where the weight is calculated by nullifying user bias , item bias , and global average , and is a constant indicating the maximum rating (e.g. for most datasets). Note that user bias and item bias can be easily calculated by , and . Here represents the sets of items rated by user , and represents the sets of users that have rated item . With the weights we calculate user ’s preference on aspect using the following formula: where if item has aspect , otherwise. Hence a user ’s overall preference can be represented by an normalized vector . As our model can output a user preference vector directly, we evaluate the explainability by calculating the average of TM@K and BM@K. The evaluation results are reported in Table 3. We observe that the explainability of AMCF is significantly better than random interpretation, and is comparable to the strong interpretable baseline LR model with much better prediction accuracy. Thus our AMCF model successfully integrates the strong prediction performance of a latent factor model and the strong interpretability of a LR model.
4.3.2 Specific Preference
Our approach is also capable of predicting a user’s preference on a specific item, i.e. , showing which aspects of item are liked/disliked by the user . Compared to user general preference across all items, the problem of which aspect of an item attracts the user most (specific preference) is more interesting and more challenging. There is no widely accepted strategy to evaluate the quality of single item preference prediction (except for direct customer survey). Here we propose a simple yet effective evaluation scheme to illustrate the quality of our model’s explanation on user specific preference. With the overall preference of user given above, and assuming is the multi-hot vector represents the aspects of item , we say the element-wise product reflects the user’s specific preference on item .
Note that we should not use the TM@K/BM@K scheme as in general preference evaluation, both and predicted ’s entries are mostly zeros, since each movie is only categorized into a few genres. Hence the quality of specific preference prediction is expressed using a similarity measure. We use
to represent the cosine similarity betweenand , and the score for specific preference prediction is defined by averaging over all user-item pairs in the test set: We report the results of specific user preferences in the column of Table 3. As the LR cannot give specific user preferences directly, we simply apply where represents the general preference predicted by LR.
An insight: Assume that for a specific user , our AMCF model can be simply written as to predict the rating for item . Note that our AMCF model can decompose the item in terms of aspects. Lets denote these aspects as . Then the prediction can be approximated by , where denote the -th attention weights for item . In the case of LR, the rating is obtained by , where is the LR model for user , is the -th coefficient of it, and represents the indicator of aspect , when the item has aspect , otherwise. The similarity between AMCF formula and LR formula listed above indicates that the coefficients of LR and the preference output of AMCF share the same intrinsic meaning, i.e., both indicate the importance of aspects.
An example: For specific explanation, given a user and an item , our AMCF model predicts a vector , representing the user ’s specific preference on an item in terms of all predefined aspects. Specifically, the magnitude of each entry of (i.e. ) represents the impact of a specific aspect on whether an item liked by a user or not. For example, in Figure 4, the movie is high-rated by both users and , however, with differential explanations: the former user preference is more on the Action genre whereas the latter is more on Sci-Fi and War. On the other hand, the same movie is low-rated by user mainly due to the dislike of Action genre.
Modelers tend to better appreciate the interpretable recommender systems whereas users are more likely to accept the explainable recommendations. In this paper, we proposed a novel interpretable feature mapping strategy attempting to achieve both goals: systems interpretability and recommendation explainability. Using extensive experiments and tailor-made evaluation schemes, our AMCF method demonstrates strong performance in both recommendation and explanation.
This work is supported by the National Science Foundation under grant no. IIS-1724227.
-  (2016) Explainable matrix factorization for collaborative filtering. In Proceedings of the 25th WWW, pp. 5–6. Cited by: §4.3.
-  (2018) ReEL: r eview aware explanation of location recommendation. In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 23–32. Cited by: §2.1.
-  (2017) Aspect based recommendations: recommending items with the most valuable aspects based on user reviews. In Proceedings of the 23rd ACM SIGKDD, pp. 717–725. Cited by: §3.
-  (2018) Neural attentional rating regression with review-level explanations. In Proceedings of the 2018 WWW, pp. 1583–1592. Cited by: §1.
-  (2016) Learning to rank features for recommendation over multiple categories. In Proceedings of the 39th International ACM SIGIR, pp. 305–314. Cited by: §2.1.
Dynamic explainable recommendation based on neural attentive models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 53–60. Cited by: §2.2.
-  (2019-06) Co-attentive multi-task learning for explainable recommendation. In IJCAI, External Links: Cited by: §2.2, §4.
-  (2018) Automatic generation of natural language explanations. In Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, pp. 57. Cited by: §2.2, §4.3.
-  (2014) Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In Proceedings of the 20th ACM SIGKDD, pp. 193–202. Cited by: §2.1.
-  (2019-03) Explainable recommendation through attentive multi-view learning. In AAAI Conference on Artificial Intelligence (AAAI), External Links: Cited by: §4.3, §4.
-  (2016) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 19. Cited by: §4.1.
-  (2015) Trirank: review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1661–1670. Cited by: §1, §2.1, §3.2.
-  (2017) Neural collaborative filtering. In Proceedings of the 26th WWW, pp. 173–182. Cited by: §4.2.
-  (2019) Explainable recommendation with fusion of aspect information. WWW 22 (1), pp. 221–240. Cited by: §1, §2.1.
-  (2009) Matrix factorization techniques for recommender systems. Computer 8, pp. 30–37. Cited by: §1, §4.2.
-  (2018) Explainable movie recommendation systems by using story-based similarity.. In IUI Workshops, Cited by: §2.2.
-  (2018) Why i like it: multi-task learning for recommendation and explanation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 4–12. Cited by: §1, §2.2, §4.3.
-  (2018) Explanation mining: post hoc interpretability of latent factor models for recommendation systems. In Proceedings of the 24th ACM SIGKDD, pp. 2060–2069. Cited by: §4.3.
-  (2010) Factorization machines. In 2010 IEEE International Conference on Data Mining, pp. 995–1000. Cited by: §4.2.
Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD, pp. 1135–1144. Cited by: §3.2.
-  (2019) Explainable reasoning over knowledge graphs for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5329–5336. Cited by: §2.2.
-  (2019) A context-aware user-item representation learning for item recommendation. ACM Transactions on Information Systems (TOIS) 37 (2), pp. 22. Cited by: §2.2.
-  (2018) Explainable recommendation: a survey and new perspectives. arXiv preprint arXiv:1804.11192. Cited by: §1, §3.2.
Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR, pp. 83–92. Cited by: §1, §2.1, §3.2.
-  (2019) WWW’19 tutorial on explainable recommendation and search. In Companion Proceedings of WWW, pp. 1330–1331. Cited by: §1.