Online retail is a growing market with sales accounting for $394.9 billion or 11.7% of total US retail sales in 2016. In the same year, e-commerce sales accounted for 41.6 percent of all retail sales growth. For some entertainment products such as movies, books, and music, online retailers have long outperformed traditional in-store retailers. One of the driving forces of this success is the ability of online retailers to collect purchase histories of customers, online shopping behavior, and reviews of products for a very large number of users. This data is driving several machine learning applications in online retail, of which personalized recommendation is the most important one. With recommender systems online retailers can provide personalized product recommendations and anticipate purchasing behavior.
In addition, the availability of product reviews allows users to make more informed purchasing choices and companies to analyze costumer sentiment towards their products. The latter was coined sentiment analysis and is concerned with machine learning approaches that map written text to scores. Nevertheless, even the best sentiment analysis methods cannot help in determining which new products a costumer might be interested in. The obvious reason is that costumer reviews are not available for products they have not purchased yet.
In recent years the availability of large corpora of product reviews has driven text-based research in the recommender system community (e.g. [21, 19, 3]). Some of these novel methods extend latent factor models to leverage review text by employing an explicit mapping from text to either user or item factors. At prediction time, these models predict product ratings based on some operation (typically the dot product) applied to the user and product representations. Sentiment analysis, however, is usually applied to some representation (e.g. bag-of-words) of review text but in a recommender system scenario the review is not available at prediction time.
With this paper we propose TransRev, a method that combines a personalized recommendation learning objective with a sentiment analysis objective into a joint learning objective. TransRev learns vector representations for users, items, and reviews jointly. The crucial advantage of TransRev is that the review embedding is learned such that it corresponds to a translation that moves the embedding of the reviewing user to the embedding of the item the review is about. This allows TransRev to approximate a review embedding at test time as the difference of the item and user embedding despite the absence of a review from the user for that item. The approximated review embedding is then used in the sentiment analysis model to predict the review score. Moreover, the approximated review embedding can be used to retrieve reviews in the training set deemed most similar by a distance measure in the embedding space. These retrieved reviews could be used for several purposes. For instance, such reviews could be provided to users as a starting point for a review, lowering the barrier to writing reviews.
We performed an extensive set of experiments to evaluate the performance of TransRev on standard recommender system data sets. TransRev outperforms state of the art methods on 15 of the 19 data sets. Moreover, we qualitatively compare actual reviews with the retrieved ones by TransRev based on a similarity metric in the review embedding space. Finally, we discuss some weaknesses of TransRev and possible future research directions.
2 TransRev: Modeling Reviews as Translations in Vector Space
We address the problem of learning prediction models for the product recommendation problem. There are a set of users , a set of items , and a set of reviews . Each represents a review written by user for item . Hence, , that is, each review is a sequence of tokens. In the following we refer to as a triple. Each such triple is associated with the review score given by the user to item .
TransRev embeds all users, items and reviews into a latent space where the embedding of a user plus the embedding of the review is learned to be close to the embedding of the reviewed item. It simultaneously learns a regression model to predict the rating given a review text. At prediction time, reviews are not available, but the modeling assumption of TransRev allows to predict the review embedding by taking the difference of the embedding of the item and user. Then this approximation is used as input feature of the regression model to perform rating prediction.
TransRev embeds all nodes and reviews into a latent space (
is a model hyperparameter). The review embeddings are computed by applying a learnable functionto the token sequence of the review
can be parameterized (typically with a neural network such as a recursive or convolutional neural network) but it can also be a simple parameter-free aggregation function that computes, for instance, the element-wise average or maximum of the token embeddings.
We propose and evaluate a simple instance of where the review embedding is the average of the embeddings of the tokens occurring in the review. More formally,
where is the embedding associated with token and is a review bias which is common to all reviews and takes values in . The review bias is of importance since there are some reviews all of whose tokens are not in the training vocabulary. In these cases we have .
The learning of the item, review, and user embeddings is determined by two learning objectives. The first objective guides the joint learning of the parameters of the regression model and the review embeddings such that the regression model performs well at review score prediction
where is the set of training triples and their associated ratings, and is a learnable regression function that is applied to the representation of the review .
While can be an arbitrary complex function, the instance of used in this work is as follows
where are the learnable weights of the linear regressor,
is the sigmoid function, and is the shortcut we use to refer to the sum of the bias terms, namely the user, item and overall bias: .
Of course, in a real-world scenario a recommender system makes rating predictions on items that users have not rated yet and, consequently, reviews are not available for those items. The application of the linear regressor of Equation (2) to new examples, therefore, is not possible at test time. Our second learning procedure aims at overcoming this limitation by leveraging ideas from embedding-based knowledge base completion methods. We want to be able to approximate a review embedding at test time such that this review embedding can be used in conjunction with the learned regression model. Hence, in addition to the learning objective (2), we introduce a second objective that forces the embedding of a review to be close to the difference between the item and user embeddings. This translation-based modeling assumption is followed in TransE  and several other knowledge base completion methods [14, 11]. We include a second term in the objective that drives the distance between (a) the user embedding translated by the review embedding and (b) the embedding of the item to be small
where and are the embeddings of the user and item, respectively. In the knowledge base embedding literature (cf. ) it is common the representations are learned via a margin-based loss, where the embeddings are updated if the score (the negative distance) of a positive triple (e.g. is not larger than the score of a negative triple (e.g. plus a margin. Note that this type of learning is required to avoid trivial solutions. The minimization problem of Equation (4) can easily be solved by setting . However, this kind of trivial solutions is avoided by jointly optimizing Equations (2) and (4), since a degenerate solution like the aforementioned one would lead to a high error with respect to the regression objective (Equation (2)). The overall objective can now be written as
where is a term that weights the approximation loss due to the modeling assumption formalized in Equation (4). In our model, corresponds to the parameters , , , and the bias terms .
At test time, we can now approximate review embeddings of pairs not seen during training by computing
With the trained regression model we can make rating predictions for unseen pairs by computing
Contrary to training, now the regression model is applied over , instead of , which is not available at test time.
3 Related Work
There are three lines of research related to our work. Recommender systems, sentiment analysis and multi-relational graph completion. There is an extensive body of work on recommender systems [1, 6, 26, 27, 7, 30, 13, 10]SVD)  computes the review score prediction as the dot product between the item embeddings and the user embeddings plus some learnable bias terms. Due to its simplicity and performance on numerous data sets it is still one of the most used methods for product recommendations. Even though there has been a flurry of research on predicting ratings from the interaction of latent representations of users and items, there is not much work on incorporating review text despite its availability in several corpora. 
was one of the first approaches that demonstrated that features extracted from review text are are useful in learned models to improve the accuracy of rating predictions. Most of the previous research that explored the utility of review text for rating prediction can be classified into two categories.
Semi-supervised approaches. HFT 
was one of the first methods combining a supervised learning objective to predict ratings with an unsupervised learning objective (e.g. latent Dirichlet allocation) for text content to regularize the parameters of the supervised model. The idea of combining two learning objectives has been explored in several additional approaches[19, 3, 9, 2]. The methods differ in the unsupervised objectives, some of which are tailored to a specific domain. For example, JMARS  outperforms HFT on a movie recommendation data set but it is outperformed by HFT on data sets similar to those used in our work .
Supervised approaches. Methods that fall into this category such as [29, 33, 8] learn latent representations of users and items from the text content so as to perform well at rating prediction. The learning of the latent representations is done via a deep architecture. The approaches differences lie mainly in the neural architectures they employ.
There is one crucial difference between the aforementioned methods and TransRev. TransRev predicts the review score based on an approximation of the review embedding computed at test time. Moreover, since TransRev is able to approximate a review embedding, we can use this embedding to retrieve reviews in the training set deemed most similar by a distance metric in the embedding space.
Similar to sentiment analysis methods, TransRev trains a regression model that predicts the review rating from the review text. Contrary to the typical setting in which sentiment analysis methods operate, however, review text is not available at prediction time in the recommender system setting. Consequently, the application of sentiment analysis for recommender systems is not directly possible. In the simplest case, a sentiment analysis method is a linear regressor applied to a text embedding (Equation (3)). TransRev trains such a regression model to perform well in conjunction with the approximated review embedding.
The third research theme related to TransRev is knowledge base completion. In the last years, many embedding-based methods have been proposed to infer missing relations in knowledge bases based on function that computes a likelihood score based on the embeddings of entities and relation types. Due to its simplicity and good performance, there is a large body of work on translation-based scoring functions [5, 14, 11].  propose an approach to large-scale sequential sales prediction that embeds items into a transition space where user embeddings are modeled as translation vectors operating on item sequences. The associated optimization problem is formulated as a sequential Bayesian ranking problem . To the best of our knowledge,  is the first work in leveraging ideas from knowledge base completion methods for recommender system. Whereas TransRev addresses the problem of rating prediction by incorporating review text,  addresses the different problem of sequential recommendation. Therefore the experimental comparison to that work is not possible. In TransRev the review embedding translates the user embedding to the product embedding. In , the user embedding translates a product embedding to the embedding of the next purchased product. TransRev is also novel in that the approximated review embeddings can be used to retrieve, from an existing training set, the reviews deemed most similar by a distance metric in the embedding space.
4 Experimental Setup
We conduct several experiments to empirically compare TransRev to state of the art methods for product recommendation. More specifically, we compare TransRev to competitive matrix factorization methods as well as methods that take advantage of review text. Moreover, we provide some qualitative results on retrieving training reviews most similar to the approximated reviews at test time.
4.1 Data Sets
We evaluate the various methods on two commonly used data sets. The Yelp Business Rating Prediction Challenge111https://www.kaggle.com/c/yelp-recsys-2013 data set consists of reviews on restaurants in Phoenix (United States). The Amazon Product Data222http://jmcauley.ucsd.edu/data/amazon has been extensively used in previous works [21, 22, 23]. The data set consists of reviews and product metadata from Amazon from May 1996 to July 2014. We focus on the 5-core versions (which contain at least 5 reviews for each user and item) of those data sets. There are 24 product categories from which we have selected those 12 used in , plus 6 randomly picked categories out of the 12 remaining ones. We treat each of these resulting 18 data sets independently in our experiments. Ratings in both benchmark data sets are integer values between 1 and 5. As in previous work, we randomly sample 80% of the reviews as training, 10% as validation, and 10% as test data. We remove reviews from the validation and test splits if they involve either a product or a user that is not part of the training data.
4.2 Review Text Preprocessing
We follow the same preprocessing steps for each data set. First, we lowercase the review texts and apply the regular expression “” to tokenize the text data, discarding those words that appear in less than 0.1 of the reviews of the data set under consideration. For the Amazon data sets, both full reviews and short summaries (rarely having more than 30 words) are available. Since classifying short documents into their sentiment is less challenging than doing the same for longer text , we have used the reviews summaries for our work. For the Yelp data only full reviews are available. We truncate these reviews to the first 200 words. Some statistics of the preprocessed data sets are summarized in Table 1.
|Cds and Vinyl||75,259||64,444||576||813,897||101,600||101,581|
|Amazon Instant Video||1.180||0.936||0.946||0.904||0.888||0.943||0.884|
|Cds and Vinyl||1.127||0.866||0.871||0.863||0.854||0.888||0.854|
|Grocery and Gourmet Food||1.165||1.004||0.985||0.964||0.961||0.973||0.957|
|Health and Personal Care||1.200||1.054||1.048||1.016||1.014||1.081||1.011|
|Patio, Lawn and Garden||1.156||0.999||0.958||0.950||0.956||1.070||0.941|
|Tools and Home Improvement||1.017||0.938||0.908||0.884||0.884||0.946||0.879|
|Toys and Games||0.975||-||0.821||0.788||0.784||0.851||0.784|
|Sports and Outdoors||0.931||-||0.856||0.828||0.824||0.882||0.823|
|Cell Phones and Accesories||1.455||-||1.357||1.290||1.285||1.365||1.279|
We compare to the matrix factorization-based methods SVD and NMF (non-negative matrix factorization) as well as approaches that leverage review text for rating prediction in a semi-supervised manner like HFT, and in a supervised manner such as Attn+CNN [29, 28] and DeepCoNN . We also compare to a simple baseline Offset that simply uses the average rating in the training set as the prediction.
4.4 Parameter Setting
We set the dimension of the embedding space to for all methods. We evaluated the robustness of TransRev to changes in the hyper-parameter but did not observe any significant performance difference. This is in line with previous work on the Yelp and Amazon data sets that observed that HFT and SVD did not show any improvements for . For SVD and NMF we used the Python package SurPRISE333https://pypi.python.org/pypi/scikit-surprise, whose optimization is performed by vanilla stochastic gradient descent, and chose the learning rate and regularization term on the validation set from the values and . For HFT we used the original implementation of the authors444http://cseweb.ucsd.edu/ jmcauley/code/codeRecSys13.tar.gz and validated the regularization term from the values . For TransRev we validated among the values and the learning rate of the optimizer and regularization term ( in our model) from the same set of values as for SVD and NMF. To ensure a fair comparison with SVD and NMF, we also use vanilla SGD to optimize TransRev. TransRev’s parameters were randomly initialized . Parameters for HFT were learned with L-BFGS which was run for 2,500 learning iterations and validated every 50 iterations.
A single learning iteration performs SGD with all review triples in the training data and their associated ratings. For TransRev we used a batch size of 64. We ran SVD, NMF and TransRev for a maximum of 500 epochs and validated every epochs. All methods are validated according to the Mean Squared Error (MSE)
where is either the validation or test set. The implementation of Attn+CNN is not publicly available, so we directly copied the MSE from  where the training, validation, and test data sets have the same proportions (). For DeepCoNN the original author code is not available and we used a third-party implementation555https://github.com/chenchongthu/DeepCoNN. We applied the default hyperparameters values for dropout and L2 regularization and used the same embedding dimension as for all other methods.
We randomly selected the 4 data sets Baby, Digital Music, Office and ToolsHome Improvement from the Amazon data and evaluated different values of for user, item and word embedding sizes. We increase from 4 to 64 and list the MSE scores in Table 3. We only observe insignificant differences in the corresponding model’s performances. This observation is in line with .
The experimental results are listed in Table 2 where the best performance is in bold font. TransRev achieves the best performance on 17 out of the 19 data sets. In line with previous work [16, 21], both TransRev and HFT outperform methods that do not take advantage of review text. TransRev is competitive with and often outperforms HFT on the benchmark data sets under consideration. To quantify that the rating predictions made by HFT and TransRev
are significantly different we have computed the dependent t-test for paired samples and for all data sets whereTransRev outperforms HFT, the p-value is smaller than 0.01.
We only copied the numbers of Attn+CNN from  since an implementation is not available. This could lead to differences in the results due to the different randomly sampled training, validation, and test sets. However, in addition to the results in this paper, Attn+CNN was compared to some of the baselines in related work . The authors there showed that Attn+CNN performs worse than either SVD or HFT or both in 10 of 12 Amazon data sets. At the same time, in our experiments, TransRev performs better than HFT and SVD on the same data sets with the exception of the Kindle Store category.
|Actual test review||Closest training review in embedding space|
|skin improved (5)||makes your face feel refreshed (5)|
|love it (5)||you’ll notice the difference (5)|
|best soap ever (5)||I’ll never change it (5)|
|it clumps (2)||gives me headaches (1)|
|smells like bug repellent (3)||pantene give it up (2)|
|fake fake fake do not buy (1)||seems to be harsh on my skin (2)|
|saved my skin (5)||not good quality (2)|
|another great release from saliva (5)||can t say enough good things about this cd (5)|
|a great collection (5)||definitive collection (5)|
|sound nice (3)||not his best nor his worst (4)|
|a complete massacre of an album (2)||some great songs but overall a dissapointment (3)|
|the very worst best of ever (1)||overall a pretty big disappointment (2)|
what a boring moment (1)
|overrated but still allright (3)|
|great cd (5)||a brilliant van halen debut album (5)|
4.7 Visualization of the Word Embeddings
Review embeddings learned by TransRev are learned so as to carry information about user ratings (Equation (2)) and information about the average word embedding of the words in the review text. As a consequence the learned word embeddings are correlated with ratings. To visualize the correlation between words and ratings we proceed as follows. First, we assign a score to each word that is computed by taking the average rating of the reviews that contain the word. Second, we compute a 2-dimensional representation of the words by applying t-SNE  to the 16-dimensional word embeddings learned by TransRev. Figure 4 depicts these 2-dimensional word embedding vectors learned for the Amazon Baby data set. The corresponding rating scores are indicated by the color of the dots.
The clusters we discovered in Figure 4 are interpretable. They are meaningful with respect to the score, observing that the bottom cluster is mostly made up of words with negative connotations (e.g. horrible, useless, terrible), the middle one of neutral words (e.g. with, products, others) and the upper one of words with positive connotations (e.g. awesome, fantastic, excellent). This shows TransRev’s ability to learn word embeddings that also capture the sentiment of the review.
4.8 Suggesting Reviews to Users
One of the characteristics of TransRev is its ability to approximate the review representation at prediction time. This approximation is used to make a rating prediction, but it can also be used to propose a tentative review on which the user can elaborate on. This is related to a number of approaches [32, 18, 24] on explainable recommendations. We think that this can lower the barrier to write reviews. We compute the Euclidean distance between the approximated review embedding and all review embeddings from the training set. We then retrieve the review text with the most similar review embedding. We investigate the quality of the tentative reviews that TransRev retrieves for the Beauty and Digital Music data sets. The example reviews listed in Table 4 show that while the overall sentiment is correct in most cases, we can also observe the following shortcomings:
The function chosen in our work is invariant to word ordering and, therefore, cannot learn that bigrams such as “not good” have a negative meaning.
Despite matching the overall sentiment, the actual and retrieved review can refer to different aspects of the product (for example, “it clumps” and “gives me headaches”).
Reviews can be specific to a single product. A straightforward improvement could be achieved by retrieving only existing reviews for the specific product under consideration.
We believe that more sophisticated sentence and paragraph representations might lead to better results in the review retrieval task. Moreover, a promising line of research has to do with learning representations for reviews that are aspect-specific. It would allow users to obtain retrieved reviews that mention specific aspect of products such as “ease of use” and “price.” We also think that similar ideas can be followed with data modalities other than review text.
TransRev is a novel approach for product recommendation combining methods and ideas from the areas of matrix factorization-based recommender systems, sentiment analysis, and knowledge graph completion. TransRev achieves state of the art performance on the data sets under consideration and outperforms existing methods in 15 of these data sets. TransRev is learned so as to be able to approximate, at test time, the embedding of the review as the difference of the embedding of the reviewed item and of the reviewing user. The approximated review embedding can be used with a sentiment analysis method to predict the review score.
-  Allen, R.B.: User models: Theory, method, and practice. International Journal of Man-Machine Studies 32(5), 511–543 (1990)
Almahairi, A., Kastner, K., Cho, K., Courville, A.C.: Learning distributed representations from reviews for collaborative filtering. In: RecSys. pp. 147–154 (2015)
-  Bao, Y., Fang, H., Zhang, J.: Topicmf: Simultaneously exploiting ratings and reviews for recommendation. In: AAAI. pp. 2–8 (2014)
-  Bermingham, A., Smeaton, A.F.: Classifying sentiment in microblogs: is brevity an advantage? In: CIKM. pp. 1833–1836 (2010)
-  Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NIPS. pp. 2787–2795 (2013)
-  Breese, J.S., Heckerman, D., Kadie, C.M.: Empirical analysis of predictive algorithms for collaborative filtering. In: UAI. pp. 43–52 (1998)
-  Brun, A., Hamad, A., Buffet, O., Boyer, A.: Towards preference relations in recommender systems. In: Preference Learning (PL 2010) ECML/PKDD 2010 Workshop (2010)
-  Catherine, R., Cohen, W.W.: Transnets: Learning to transform for recommendation. In: RecSys. pp. 288–296 (2017)
-  Diao, Q., Qiu, M., Wu, C., Smola, A.J., Jiang, J., Wang, C.: Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In: KDD. pp. 193–202 (2014)
-  Dong, X., Yu, L., Wu, Z., Sun, Y., Yuan, L., Zhang, F.: A hybrid collaborative filtering model with deep structure for recommender systems. In: AAAI. pp. 1309–1315 (2017)
-  García-Durán, A., Bordes, A., Usunier, N.: Composing relationships with translations. In: EMNLP. pp. 286–290. The Association for Computational Linguistics (2015)
-  Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS. JMLR Proceedings, vol. 9, pp. 249–256 (2010)
-  Guo, G., Zhang, J., Yorke-Smith, N.: Trustsvd: Collaborative filtering with both the explicit and implicit influence of user trust and of item ratings. In: AAAI. pp. 123–129 (2015)
-  Guu, K., Miller, J., Liang, P.: Traversing knowledge graphs in vector space. In: EMNLP. pp. 318–327. The Association for Computational Linguistics (2015)
-  He, R., Kang, W., McAuley, J.: Translation-based recommendation. In: RecSys. pp. 161–169 (2017)
-  Jakob, N., Weber, S.H., Müller, M.C., Gurevych, I.: Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In: 1st international CIKM workshop on Topic-sentiment analysis for mass opinion. pp. 57–64 (2009)
-  Koren, Y., Bell, R.M., Volinsky, C.: Matrix factorization techniques for recommender systems. IEEE Computer 42(8), 30–37 (2009)
Lawlor, A., Muhammad, K., Rafter, R., Smyth, B.: Opinionated explanations for recommendation systems. In: International Conference on Innovative Techniques and Applications of Artificial Intelligence. pp. 331–344. Springer (2015)
-  Ling, G., Lyu, M.R., King, I.: Ratings meet reviews, a combined approach to recommend. In: RecSys. pp. 105–112 (2014)
-  Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605 (2008)
-  McAuley, J.J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: RecSys. pp. 165–172 (2013)
-  McAuley, J.J., Pandey, R., Leskovec, J.: Inferring networks of substitutable and complementary products. In: KDD. pp. 785–794 (2015)
-  McAuley, J.J., Targett, C., Shi, Q., van den Hengel, A.: Image-based recommendations on styles and substitutes. In: SIGIR. pp. 43–52 (2015)
-  Qureshi, M.A., Greene, D.: Lit@ eve: Explainable recommendation based on wikipedia concept vectors. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 409–413. Springer (2017)
Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized markov chains for next-basket recommendation. In: WWW. pp. 811–820 (2010)
-  Rennie, J.D.M., Srebro, N.: Fast maximum margin matrix factorization for collaborative prediction. In: ICML. ACM International Conference Proceeding Series, vol. 119, pp. 713–719 (2005)
-  Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: WWW. pp. 285–295 (2001)
-  Seo, S., Huang, J., Yang, H., Liu, Y.: Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In: RecSys. pp. 297–305 (2017)
-  Seo, S., Huang, J., Yang, H., Liu, Y.: Representation learning of users and items for review rating prediction using attention-based convolutional neural network. In: 3rd International Workshop on Machine Learning Methods for Recommender Systems (MLRec) (2017)
Wang, H., Wang, N., Yeung, D.: Collaborative deep learning for recommender systems. In: KDD. pp. 1235–1244 (2015)
-  Wu, C., Beutel, A., Ahmed, A., Smola, A.J.: Explaining reviews and ratings with PACO: poisson additive co-clustering. In: WWW (Companion Volume). pp. 127–128 (2016)
-  Zhang, Y., Lai, G., Zhang, M., Zhang, Y., Liu, Y., Ma, S.: Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In: SIGIR. pp. 83–92 (2014)
-  Zheng, L., Noroozi, V., Yu, P.S.: Joint deep modeling of users and items using reviews for recommendation. In: WSDM. pp. 425–434 (2017)