1. Introduction
With the explosive growth of Internet information, recommendation systems have been playing an increasingly important role in online Ecommerce and applications in a variety of areas, including music streaming service such as Spotify^{1}^{1}1http://www.spotify.com and Apple Music, movie rating such as IMDB^{2}^{2}2http://www.imdb.com, video streaming service such as Netflix and Youtube, job recommendation such as LinkedIn^{3}^{3}3http://www.linkedin.com, and product recommendation such as Amazon. Many recommendation methods are based on Collaborative Filtering (CF) which mainly makes use of historical ratings (Sarwar et al., 2001; Salakhutdinov and Mnih, 2007; Koren et al., 2009; Koren, 2008; Shi et al., 2010; Lee and Seung, 2001; Marlin, 2003). Recently, some approaches also consider text information in addition to the rating data (Wang and Blei, 2011; McAuley and Leskovec, 2013; Ling et al., 2014; Almahairi et al., 2015; Zheng et al., 2017; Ren et al., 2017)
. After some investigations, we observe that the text information in most recommendation tasks can be generally classified into two types: item specifications
(Wang and Blei, 2011; Wang et al., 2015, 2016) and user reviews (McAuley and Leskovec, 2013; Xu et al., 2014; Ling et al., 2014; Xu et al., 2015; Almahairi et al., 2015; Zheng et al., 2017; Ren et al., 2017). Item specifications are the text information for describing the attributes or properties of the items. For example, in article recommendation such as CiteULike^{4}^{4}4http://www.citeulike.org, it refers to titles and abstracts of papers. In product recommendation such as Amazon, it refers to product descriptions and technical specification information. The second type is user reviews which are written by users to explain why they like or dislike an item based on their usage experiences. Multifaceted information can be extracted from reviews and used as user preferences or item features, which otherwise cannot be obtained from the overall ratings (Chen et al., 2015). Although both types of text data are found to be useful for the recommendation task, they have some inherent limitations. Concretely, the former cannot reflect users’ experience and preference, and the latter is usually too long and suffers from noise.Recently, some Ecommerce sites such as Yelp^{5}^{5}5http://www.yelp.com launch a new interaction box called Tips on their mobile platforms. As shown in Figure 1, the left column is a review from the user “Monica H.”, and tips from several other users are shown on the right column. In the review text, Monica first generally introduced the restaurant, and then narrated her dining experience in detail. In the tips text, users expressed their experience and feelings plainly using short texts, such as “The risotto was excellent. Amazing service.”. They also provide some suggestions to other people directly in several words, such as “You have to make reservations much in advance.” In contrast to item specifications and user reviews, tips have several characteristics: (1) tips are typically singletopic nuggets of information, and shorter than reviews with a length of about 10 words on average; (2) tips can express user experience, feelings, and suggestions directly; (3) tips can give other people quick insights, saving the time of reading long reviews. In essence, writing some tips and giving a numerical rating are two facets of a user’s product assessment action, expressing the user experience and feelings. Jointly modeling these two facets is helpful for designing a better recommendation system.
Existing models only integrate text information such as item specifications (Wang and Blei, 2011; Wang et al., 2015, 2016) and user reviews (McAuley and Leskovec, 2013; Xu et al., 2014; Ling et al., 2014; Xu et al., 2015; Almahairi et al., 2015; Zheng et al., 2017; Ren et al., 2017) to enhance the performance of latent factor modeling and rating prediction. To our best knowledge, we are the first to consider tips for improving the recommendation quality. We aim at developing a model that is capable of conducting the latent factor modeling and rating prediction, and more importantly, it can generate tips based on the learnt latent factors. We do not just extract some existing sentences and regard them as tips. Conversely, we investigate the task of automatically construing a concise sentence as tips, such capability can be treated as simulating how users write tips in order to express their experience and feelings, just as if they have bought and consumed the item. Therefore, we named this task abstractive tips generation
, where “abstractive” is a terminology from the research of text summarization
(Bing et al., 2015).Generating abstractive tips only based on user latent factors and item latent factors is a challenging task. Recently, gated recurrent neural networks such as Long ShortTerm Memory (LSTM)
(Hochreiter and Schmidhuber, 1997)and Gated Recurrent Unit (GRU)
(Cho et al., 2014)demonstrate high capability in text generation related tasks
(Bahdanau et al., 2015; Rush et al., 2015). Moreover, inspired by (He et al., 2017; Wang et al., 2015), neural network based models can help learn more effective latent factors when conducting rating prediction and improve the performance of collaborative filtering. We employ deep learning techniques for latent factor modeling, rating prediction, and abstractive tips generation. For abstractive tips generation, gated recurrent neural networks are employed to “translate” a user latent factor and an item latent factor into a concise sentence to express user experience and feelings. For neural rating regression, a multilayer perceptron network
(Rosenblatt, 1961) is employed to project user latent factors and item latent factors into ratings. All the neural parameters in the gated recurrent neural networks and the multilayer perceptron network as well as the latent factors for users and items are learnt by a multitask learning approach in an endtoend training paradigm.The main contributions of our framework are summarized below:

We propose a deep learning based framework named NRT which can simultaneously predict precise ratings and generate abstractive tips with good linguistic quality simulating user experience and feelings. All the neural parameters as well as the latent factors for users and items are learnt by a multitask learning approach in an endtoend training paradigm.

We are the first to explore using tips information to improve the recommendation quality. In essence, writing some tips and giving a numerical rating are two facets of a user’s product assessment action, expressing the user experience and feelings. Jointly modeling these two facets is helpful for designing a better recommendation system.

Experimental results on benchmark datasets show that our framework achieves better performance than the stateoftheart models on both tasks of rating prediction and abstractive tips generation.
2. Related Works
Collaborative filtering (CF) has been studied for a long time and has achieved some success in recommendation systems (Ricci et al., 2011; Su and Khoshgoftaar, 2009). Latent Factor Models (LFM) based on Matrix Factorization (MF) (Koren et al., 2009)
play an important role for rating prediction. Various MF algorithms have been proposed, such as Singular Value Decomposition (SVD) and SVD++
(Koren, 2008), Nonnegative Matrix Factorization (NMF) (Lee and Seung, 2001), and Probabilistic Matrix Factorization (PMF) (Salakhutdinov and Mnih, 2007). These methods map users and items into a shared latent factor space, and use a vector of latent features as the representation for users and items respectively. Then the inner product of their latent factor vectors can reflect the interactions between users and items.
The recommendation performance will degrade significantly when the rating matrix is very sparse. Therefore, some works consider text information for improving the rating prediction. Both item specifications and user reviews have been investigated. In order to use the item specifications, CTR (Wang and Blei, 2011) integrates PMF (Salakhutdinov and Mnih, 2007) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003) into a single framework and employs LDA to model the text. Collaborative Deep Learning (CDL) (Wang et al., 2015) employs a hierarchical Bayesian model which jointly performs deep representation learning for the specification text content and collaborative filtering for the rating matrix. For user review texts, some research works, such as HFT (McAuley and Leskovec, 2013), RMR (Ling et al., 2014), TriRank (He et al., 2015), and sCVR (Ren et al., 2017), integrate topic models in their frameworks to generate the latent factors for users and items incorporating review texts. Moreover, TriRank and sCVR have been explicitly claimed that they can provide explanations for recommendations. However, one common limitation of them is that their explanations are simple extractions of words or phrases from the texts. In contrast, we aim at generating concise sentences representing tips, which express the feeling of users while they are reviewing an item.
Deep Learning (DL) techniques have achieved significant success in the fields of computer vision, speech recognition, and natural language processing
(Goodfellow et al., 2016). In the field of recommendation systems, researchers have made some attempts by combining different neural network structures with collaborative filtering to improve the recommendation performance. Salakhutdinov et al. (2007)employ a class of twolayer Restricted Boltzmann Machines (RBM) with an efficient learning algorithm to model user interactions and perform collaborative filtering. Considering that the training procedure of AutoEncoders
(Ng, 2011) is more straightforward, some research works employ autoencoders to tackle the latent factor modeling and rating prediction (Sedhain et al., 2015; Wu et al., 2016a; Vincent et al., 2010). Recently, He et al. (2017) combine generalized matrix factorization and multilayer perceptions to find better latent structures from the user interactions for improving the performance of collaborative filtering. To model the temporal dynamic information in the user interactions, Wu et al. (2017) propose a recurrent recommender network which is able to predict future behavioral trajectories.3. Framework Description
3.1. Overview
The goal of recommendation, similar to collaborative filtering, is to predict a rating given a user and an item. Additionally, in our proposed task, our model also generates abstractive tips in the form of a concise sentence. At the operational stage, only a user and an item are given. There is no given review texts and obviously no tips texts.
At the training stage, the training data consists of users, items, tips texts, and review content. Table 1 depicts the notations and key concepts used in our paper. We denote the whole training corpus by , where and are the sets of users and items respectively, is the set of ratings, is the set of review documents, and is the set of tips sentences. As shown in Figure 2, our framework contains two major components: neural rating regression on the left and abstractive tips generation on the right. There are two crucial latent variables : user latent factors and item latent factors , where is the number of users, and is the number of items. and are the latent factor dimension for users and items respectively. For neural rating regression, given the user latent factor and the item latent factor , a multilayer perceptron network based regression model is employed to project and
to a real value via several layers of nonlinear transformations.
For abstractive tips generation, we design a sequence decoding model based on a gated recurrent neural network called Gated Recurrent Unit (GRU) (Cho et al., 2014) to “translate” the combination of a user latent factor and an item latent factor into a sequence of words, representing tips. Moreover, two kinds of context information generated based on and are also fed into the sequence decoder model. One is the hidden variable from the rating regression component, which is used as sentiment context information. The other is the hidden output of a generative model for review texts. At the operational or testing stage, we use a beam search algorithm (Koehn, 2004) for decoding and generating the best tips given a trained model. All the neural parameters and the latent factors for users, items, and words are learnt by a multitask learning approach. The model can be trained efficiently by an endtoend paradigm using backpropagation algorithms (Rumelhart et al., 1988).
Symbol  Description 

training set  
vocabulary  
set of users  
set of items  
set of ratings  
set of reviews  
set of tips  
context for tips decoder  
user latent factors  
item latent factors  
word embeddings  
neural hidden states  
user latent factor  
item latent factor  
mapping matrix  
bias item  
set of neural parameters  
rating of user to item  
sigmoid function  
softmax function  
hyperbolic tangent function 
3.2. Neural Rating Regression
The aim of the neural rating regression component is to conduct representation learning for the user factor and the item factor mentioned above. In order to predict a rating, we need to design a model that can learn the function which can project and to a realvalued rating :
(1) 
In most of the existing latent factor models, is represented by the inner product of and , or adds a bias item for the corresponding user and item respectively:
(2) 
It is obvious that the rating is calculated by a linear combination of user latent factors, item latent factors, and bias. The learnt latent factors may not capture the complex structure implied in the user historical interactions. Recently, some research works on representation learning from different fields, such as computer vision (Krizhevsky et al., 2012; Goodfellow et al., 2014), natural language processing (Mikolov et al., 2013; Le and Mikolov, 2014), and knowledge base completion (Socher et al., 2013), demonstrate that nonlinear transformations will enhance the representation ability. Moreover, most latent factor models assume that users and items or even text information are in the same vector space and share the same latent factors. Actually, user, item, and text information are different kinds of objects with different characteristics. Modeling them in the same vector space would lead to limitations.
As shown in left part in Figure 2, we let user latent factors and item latent factors in different vector space, where and are the latent factor dimension for users and items respectively. and
are the number of users and items respectively. In order to model the relationship between users and items, one may consider to use a neural tensor network
(Socher et al., 2013) to describe the interactions between users and items, such as , where . However, our investigation shows that such tensor network has too many parameters resulting in difficulty for handling largescale datasets commonly found in recommendation applications. Therefore, we employ a multilayer perceptron network to model the interactions between users and items, and map user latent factors and item latent factors into realvalued ratings.Specifically, we first map latent factors to a shared hidden space:
(3) 
where and are the mapping matrices for user latent factors and item latent factors respectively. is the bias term. is the dimension of the hidden vector . The superscript refers to variables related to the rating prediction component.
is the sigmoid activation function:
(4) 
This nonlinear transformation can improve the performance of the rating prediction. For better performance, we can add more layers of nonlinear transformations into our model:
(5) 
where is the mapping matrix for the variables in the hidden layers. is the index of a hidden layer. Assume that is the output of the last hidden layer. The output layer transforms into a realvalued rating :
(6) 
where and .
In order to optimize the latent factors and , as well as all the neural parameters
, we formulate it as a regression problem and the loss function is formulated as:
(7) 
where represents the training set. is the ground truth rating assigned by the user to the item .
3.3. Neural Abstractive Tips Generation
Generating abstractive tips only based on user latent factors and item latent factors is a challenging task. As mentioned above, abstractive tips generation is different from review content summarization and explainable topic words extraction. At the operational stage, the input only consists of a user and an item, but without any text information. After obtaining the user latent factor and the item latent factor from the matrices and , we should design a strategy to “translate” these two latent vectors into a fluent sequence of words. Recently, gated recurrent neural networks such as Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) demonstrate high capability in text generation related tasks (Bahdanau et al., 2015; Rush et al., 2015). Inspired by these works and considering that GRU has comparable performance but with less parameters and more efficient computation, we employ GRU as the basic model in our sequence modeling framework. The right part of Figure 2 depicts our tips generation model.
The major idea of sequence modeling for tips generation can be expressed as follows:
(8) 
where is the th word of the tips . denotes the context information which will be described in the following sections. is the softmax function and defined as follows:
(9) 
is the sequence hidden state at the time and it depends on the input at the time and the previous hidden state :
(10) 
Here can be the vanilla RNN, LSTM, or GRU. In the case of GRU, the state updates are processed according to the following operations:
(11) 
where is the embedding vector for the word of the tips and the vector is also learnt from our framework. is the reset gate, is the update gate. denotes elementwise multiplication. is the hyperbolic tangent activation function.
As shown in Figure 2, when , the sequence model has no input information. Therefore, we utilize the context information to initialize
. Context information is very crucial in a sequence decoding framework, which will directly affect the performance of sequence generation. In the field of neural machine translation
(Wu et al., 2016b), context information includes the encoding information of the source input and the decoding attention information from the source. In the field of neural summarization (Rush et al., 2015; Li et al., 2017), the context is the encoded document information. In our framework, the corresponding user and item are the input from which we design two kinds of context information for tips generation: predicted rating and the generated hidden variable for the review text .For the input, we just find the user latent factor and the item latent factor from the matrices and :
(12) 
For the context of rating information, we can employ the output of the rating regression component in Section 3.2. Specifically, after getting the predicted rating , for example, , we cast it into an integer , and add a step of vectorization. Then we get the vector representation of rating . If the rating range is , we will get the rating vector :
(13) 
is used as the context information to control the sentiment of the generated tips.
Another context information is from review texts. One should note that review texts cannot be used as the input directly. The reason is that at the testing state, there are no review information. We only make use of reviews to enhance the representation ability of the latent vectors and . We develop a standard generative model for review texts based on a multilayer perceptron. For review content written by the user to the item , the generative process is defined as follows. We first map the user latent vector and the item latent factor into a hidden space:
(14) 
It is obvious that we can also add more layers of nonlinear transformation into the generative hidden layers. Assume that is the output of the last hidden layer. We add the final generative layer to map into a size vector , where is the vocabulary of words in the reviews and the tips:
(15) 
where and . is the softmax function defined in Equation 9. In fact we can regard as a multinomial distribution defined on . Therefore, we can draw some words from and generate the content of the review . We let be the ground truth of . is the term frequency of the word in . We employ the likelihood to evaluate the performance of this generative process. For convenience, we use the Negative LogLikelihood (NLL) as the loss function:
(16) 
One characteristic of the design of our model is that both the rating and review texts are generated from the same user latent factors and item latent factors , i.e., and are shared by the subtasks of rating prediction and review text generation. Thus, in the training stage, both of and receive the feedback from all the subtasks, which improves the representation ability of the latent factors.
After obtaining all the context information , we integrate them into the initial decoding hidden state using a nonlinear transformation:
(17) 
where is the user latent factor, is the item latent factor, is the vectorization for the predicted rating , and is the generated hidden variable from the review text. Then GRU can conduct the sequence decoding progress. After getting all the sequence hidden states, we feed them to the final output layer to predict the word sequence in tips.
(18) 
where and . is the softmax function defined in Equation 9
. Then the word with the largest probability is the decoding result for the step
:(19) 
At the training stage, we also use NLL as the loss function, where is the vocabulary index of the word :
(20) 
At the testing stage, given a trained model, we employ the beam search algorithm to find the best sequence having the maximum loglikelihood.
(21) 
The details of the beam search algorithm is shown in Algorithm 1.
3.4. Multitask Learning
We integrate all the subtasks of rating prediction and abstractive tips generation into a unified multitask learning framework whose objective function is:
(22) 
where is the rating regression loss from Equation 7, is the review text generation loss from Equation 16, and is the tips generation loss from Equation 20. is the set of neural parameters. , , , and are the weight proportion of each term. The whole framework can be efficiently trained using backpropagation in an endtoend paradigm.
4. Experimental Setup
4.1. Research Questions
We list the research questions we want to investigate in this paper:

RQ1: What is the performance of NRT in rating prediction tasks? Does it outperform the stateoftheart models? (See Section 5.1.)

RQ2: What is the performance of NRT in abstractive tips generation? Can the generated tips express user experience and feelings? (See Section 5.2)

RQ3: What is the relationship between predicted ratings and the sentiment of generated tips? (See Section 5.3)
We conduct extensive experiments to investigate the above research questions.
4.2. Datasets
In our experiments, we use four standard benchmark datasets from different domains to evaluate our model. The ratings of these datasets are integers in the range of . There are three datasets from Amazon 5core^{6}^{6}6http://jmcauley.ucsd.edu/data/amazon: Books, Electronics, and Movies & TV. “Books” is the largest dataset among all the domains. It contains 603,668 users, 367,982 items, and 8,887,781 reviews. We regard the field “summary” as tips, and the number of tips texts is same with the number of reviews.
Another dataset is from Yelp Challenge 2016^{7}^{7}7https://www.yelp.com/dataset_challenge. It is also a largescale dataset consisting of restaurant reviews and tips. The number of users is 684,295, which is the largest among all the datasets. Therefore this dataset is also the most sparse one. Tips are included in the dataset. For samples without tips, the first sentence of review texts is extracted and regarded as tips.
We filter out the words with low term frequency in the tips and review texts, and build a vocabulary for each dataset. We show the statistics of our datasets in Table 2.
Books  Electronics  Movies&TV  Yelp2016  

users  603,668  192,403  123,960  684,295 
items  367,982  63,001  50,052  85,533 
reviews  8,887,781  1,684,779  1,697,533  2,346,227 
258,190  70,294  119,530  111,102 
4.3. Evaluation Metrics
For the evaluation of rating prediction, we employ two metrics: Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Both of them are widely used for rating prediction in recommender systems. Given a predicted rating and a groundtruth rating from the user for the item , the RMSE is calculated as:
(23) 
where indicates the number of ratings between users and items. Similarly, MAE is calculated as follows:
(24) 
For the evaluation of abstractive tips generation, the ground truth is the tips written by the user for the item. We use ROUGE (Lin, 2004)
as our evaluation metric with standard options
^{8}^{8}8ROUGE1.5.5.pl n 4 w 1.2 m 2 4 u c 95 r 1000 f A p 0.5 t 0. It is a classical evaluation metric in the field of text summarization (Lin, 2004; Bing et al., 2015). It counts the number of overlapping units between the generated tips and the ground truth written by users. Assuming that is the generated tips,is ngram,
is the number of ngrams in ( or ), is the number of ngrams cooccurring in and , then the ROUGEN score for is defined as follows:(25) 
When , we can get , and when , we get . We use Recall, Precision, and Fmeasure of ROUGE1 (R1), ROUGE2 (R2), ROUGEL (RL), and ROUGESU4 (RSU4) to evaluate the quality of the generated tips.
4.4. Comparative Methods
To evaluate the performance of rating prediction, we compare our model with the following methods:

RMR: Ratings Meet Reviews (Ling et al., 2014). It utilizes a topic modeling technique to model the review texts and achieves significant improvements compared with other strong topic modeling based methods.

CTR: Collaborative Topic Regression (Wang and Blei, 2011). It is a popular method for scientific articles recommendation by solving a oneclass collaborative filtering problem. Note that CTR uses both ratings and item specifications.

NMF: Nonnegative Matrix Factorization (Lee and Seung, 2001). It only uses the rating matrix as the input.

PMF: Probabilistic Matrix Factorization (Salakhutdinov and Mnih, 2007)
. Gaussian distribution is introduced to model the latent factors for users and items.

LRMF: Learning to Rank with Matrix Factorization (Shi et al., 2010). It combines a listwise learningtorank algorithm with matrix factorization to improve recommendation.

SVD++: It extends Singular Value Decomposition by considering implicit feedback information for latent factor modeling (Koren, 2008).

URP: User Rating Profile modeling (Marlin, 2003). Topic models are employed to model the user preference from a generative perspective. It still only uses the rating matrix as input.
For abstractive tips generation, we find that no existing works can generate abstractive tips purely based on latent factors of users and items. In order to evaluate the performance and conduct comparison with some baselines, we refine some existing methods to make them capable of extracting sentences for tips generation as follows.
LexRank (Erkan and Radev, 2004) is a classical method in the field of text summarization. We add a preprocessing procedure to prepare the input texts for LexRank, which consists of the following steps: (1) Retrieval: For the user , we first retrieve all her reviews from the training set. For the item , we use the same method to get . (2) Filtering: Assuming that the ground truth rating for and is , then we remove all the reviews from and whose ratings are not equal to . The reviews whose words only appear in one set are also removed. (3) Tips extraction: We first merge and to get , then the problem can be regarded as a multidocument summarization problem. LexRank can extract a sentence from as the final tips. Note that we give an advantage of this method since the ground truth ratings are used.
CTR contains a topic model component and it can generate topics for items. So the topic related variables are employed to extract tips: (1) We first get the latent factor for item , and draw the topic with the largest probability from . Then from , which is a multinomial distribution of on , we select the top words with the largest probability. (2) The most similar sentence from is extracted as the tips. This baseline is named CTR. Another baseline method RMR is designed in the same way.
Finally, we list all the methods and baselines in Table 3.
Acronym  Gloss  Reference 
NRT  Neural rating and tips generation  Section 3 
Rating prediction  
RMR  Ratings meet reviews model  (Ling et al., 2014) 
CTR  Collaborative topic regression model  (Wang and Blei, 2011) 
NMF  Nonnegative matrix factorization  (Lee and Seung, 2001) 
PMF  Probabilistic matrix factorization  (Salakhutdinov and Mnih, 2007) 
LRMF  Listwise learning to rank for item ranking  (Shi et al., 2010) 
SVD++  Factorization meets the neighborhood  (Koren, 2008) 
URP  User rating profile modeling using LDA  (Marlin, 2003) 
Tips generation  
LexRank  Pagerank for summarization  (Erkan and Radev, 2004) 
CTR  CTR for tips topic extraction  (Wang and Blei, 2011) 
RMR  RMR for tips topic extraction  (Ling et al., 2014) 
4.5. Experimental Settings
Each dataset is divided into three subsets: , , and , for training, validation, and testing, receptively. All the parameters of our model are tuned with the validation set. After the tuning process, we set the number of latent factors for LRMF, NMF, PMF, and SVD++. We set the number of topics for the methods using topic models. In our model NRT, we set for user latent factors, item latent factors, and word latent factors. The dimension of the hidden size is . The number of layers for the rating regression model is , and for the tips generation model is . We set the beam size , and the maximum length . For the optimization objective, we let the weight parameters , and . The batch size for minibatch training is .
All the neural matrix parameters in hidden layers and RNN layers are initialized from a uniform distribution between
. Adadelta (Zeiler, 2012)is used for gradient based optimization. Our framework is implemented with Theano
(Theano Development Team, 2016) on a single Tesla K80 GPU.5. Results and Discussions
5.1. Rating Prediction (RQ1)
Books  Electronics  Movies  Yelp2016  

MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  
LRMF  1.939  2.153  2.005  2.203  1.977  2.189  1.809  2.038 
PMF  0.882  1.219  1.220  1.612  0.927  1.290  1.320  1.752 
NMF  0.731  1.035  0.904  1.297  0.794  1.135  1.062  1.454 
SVD++  0.686  0.967  0.847  1.194  0.745  1.049  1.020  1.349 
URP  0.704  0.945  0.860  1.126  0.764  1.006  1.030  1.286 
CTR  0.736  0.961  0.903  1.154  0.854  1.069  1.174  1.392 
RMR  0.681  0.933  0.822  1.123  0.741  1.005  0.994  1.286 
NRT  0.667*  0.927*  0.806*  1.107*  0.702*  0.985*  0.985*  1.277* 

*Statistical significance tests show that our method is better than RMR (Ling et al., 2014).
The rating prediction results of our framework NRT and comparative models on all datasets are given in Table 4
. It shows that our model consistently outperforms all comparative methods under both MAE and RMSE metrics on all datasets. From the comparison, we notice that the topic modeling based methods CTR and RMR are much better than LRMF, NMF, PMF, and SVD++. The reason is that CTR and RMR consider text information such as item specifications and user reviews to improve the representation quality of latent factors, while the traditional CFbased models (e.g. LRMF, NMF, PMF, and SVD++) only consider the rating matrix as the input. Statistical significance of differences between the performance of NRT and RMR, the best comparison method, is tested using a twotailed paired ttest. The result shows that NRT is significantly better than RMR.
Except jointly learning the tips decoder, we did not apply any sophisticated linguistic operations on the texts of reviews and tips. Jointly modeling the tips information is already very helpful for recommendation performance. In fact, tips and its corresponding rating are two facets of product assessment by a user on an item, namely, the qualitative facet and the quantitative facet. Our framework NRT elegantly captures this information with its multitask learning model. Therefore the learnt latent factors are more effective.
5.2. Abstractive Tips Generation (RQ2)
Our NRT model can not only solve the rating prediction problem, but also generate abstractive tips simulating how users express their experience and feelings. The evaluation results of tips generation of our model and the comparative methods are given in Table 5Table 8. In order to capture more details, we report Recall, Precision, and Fmeasure (in percentage) of ROUGE1, ROUGE2, ROUGEL, and ROUGESU4. Our model achieves the best performance in the metrics of Precision and F1measure among all the four datasets. On the dataset of Movies&TV, NRT also achieves the best Recall for all ROUGE metrics.
Methods  ROUGE1  ROUGE2  ROUGEL  ROUGESU4  
R  P  F1  R  P  F1  R  P  F1  R  P  F1  
LexRank  12.94  12.02  12.18  2.26  2.29  2.23  11.72  10.89  11.02  4.13  4.15  4.02 
RMR  13.80  11.69  12.43  1.79  1.57  1.64  12.54  10.55  11.25  4.49  3.54  3.80 
CTR  14.06  11.85  12.62  2.03  1.80  1.87  12.68  10.64  11.35  4.71  3.71  3.99 
NRT  10.30  19.28  12.67  1.91  3.76  2.36  9.71  17.92  11.88  3.24  8.03  4.13 
Methods  ROUGE1  ROUGE2  ROUGEL  ROUGESU4  
R  P  F1  R  P  F1  R  P  F1  R  P  F1  
LexRank  13.42  13.48  12.08  1.90  2.04  1.83  11.72  11.48  10.44  4.57  4.51  3.88 
RMR  15.68  11.32  12.30  2.52  2.04  2.15  13.37  9.61  10.45  5.41  3.72  3.97 
CTR  15.81  11.37  12.38  2.49  1.92  2.05  13.45  9.62  10.50  5.39  3.63  3.89 
NRT  13.08  17.72  13.95  2.59  3.36  2.72  11.93  16.01  12.67  4.51  6.69  4.68 
Methods  ROUGE1  ROUGE2  ROUGEL  ROUGESU4  
R  P  F1  R  P  F1  R  P  F1  R  P  F1  
LexRank  13.62  14.11  12.37  1.92  2.09  1.81  11.69  11.74  10.47  4.47  4.53  3.75 
RMR  14.64  10.26  11.33  1.78  1.36  1.46  12.62  8.72  9.67  4.63  3.00  3.28 
CTR  15.13  10.37  11.57  1.90  1.42  1.54  13.02  8.77  9.85  4.88  3.03  3.36 
NRT  15.17  20.22  16.20  4.25  5.72  4.56  13.82  18.36  14.73  6.04  8.76  6.33 
Methods  ROUGE1  ROUGE2  ROUGEL  ROUGESU4  
R  P  F1  R  P  F1  R  P  F1  R  P  F1  
LexRank  11.32  11.16  11.04  1.32  1.34  1.31  10.33  10.16  10.06  3.41  3.38  3.26 
RMR  11.17  10.25  10.54  2.25  2.16  2.19  10.22  9.39  9.65  3.88  3.66  3.72 
CTR  10.74  9.95  10.19  2.21  2.14  2.15  9.91  9.19  9.41  3.96  3.64  3.70 
NRT  9.39  17.75  11.64  1.83  3.39  2.22  8.70  16.27  10.74  3.01  7.06  3.78 
For most of the datasets, our NRT model does not outperform the baselines on Recall. There are several reasons: (1) The ground truth tips used in the training set are very short, only about 10word length on average. Naturally, the model trained using this dataset cannot generate long sentence. (2) The mechanism of typical beam search algorithm makes the model favor short sentences. (3) The comparison models are extractionbased approaches and these models favor to extract long sentence, although we add a length (i.e., 20 words) restriction on them.
We investigate the performance of different beam size used in the beam search algorithm. The relationship between ROUGE and on two validation sets of Electronics and Movies&TV is shown in Figure 3. We test and find that when our model can achieve the best performance of tips generation.
Inspired by (Wu et al., 2016b), we make use of LengthNormalization (LN) to adjust the logprobability in the beam search algorithm to make the beam search algorithm also consider long sentences:
(26) 
where is the decoded sequence, , and . We conduct several experiments to verify the effectiveness of LN. The comparison results are shown in Table 9, where F1measures of ROUGE evaluation metrics are reported. It is obvious that our model NRT with LN is much better than the one without LN.
Dataset  Method  R1  R2  RL  RSU4 

Electronics  NRT w/o LN  13.36  2.65  12.34  4.56 
NRT  13.72  2.68  12.57  4.66  
Movies&TV  NRT w/o LN  14.86  3.72  13.76  5.46 
NRT  15.21  4.00  13.90  5.71 
5.3. Case Analysis (RQ3)
For the purpose of analyzing the linguistic quality and the sentiment correlation between the predicted ratings and the generated tips, we selected some real cases form different domains. The results are listed in Table 10. Although our model generates tips in an abstractive way, tips’ linguistic quality is quite good.
For the sentiment correlation analysis, we also choose some generated tips with negative sentiment. Take the tips “Not as good as i expected.” as an example, our model predicts a rating of , which clearly shows the consistent sentiment. The ground truth tips of this example is “Jack of all trades master of none. ”, which also conveys a negative sentiment. One interesting observation is that its ground truth rating is the full mark , which we guess, may be clicked by a fat finger. Nevertheless, our model can generate a consistent sentiment between this case’s rating and tips. Another generated tips “What a waste of time and money.” with a negative predicted rating of also demonstrates this property.
There are also some bad cases. For example, the predicted rating of the generated tips “Not bad for the price.” is , which is a positive polarity. But the sentiment of the generated tips is neutral, consistent with the ground truth. Generally speaking, our model can achieve satisfactory performance on both rating prediction and abstractive tips generation.
Rating  Tips 

4.64  This is a great product for a great price. 
5  Great product at a great price. 
4.87  I purchased this as a replacement and it is a perfect fit and the sound is excellent. 
5  Amazing sound. 
4.69  I have been using these for a couple of months. 
4  Plenty of wire gets signals and power to my amp just fine quality wise. 
4.87  One of my favorite movies. 
5  This is a movie that is not to be missed. 
4.07  Why do people hate this film. 
4  Universal why didnt your company release this edition in 1999. 
2.25  Not as good as i expected. 
5  Jack of all trades master of none. 
1.46  What a waste of time and money. 
1  The coen brothers are two sick bastards. 
4.34  Not bad for the price. 
3  Ended up altering it to get rid of ripples. 
6. Conclusions
We propose a deep learning based framework named NRT which can simultaneously predict precise ratings and generate abstractive tips with good linguistic quality simulating user experience and feelings. For abstractive tips generation, GRU with context information is employed to “translate” user and item latent factors into a concise sentence. All the neural parameters as well as the latent factors for users and items are learnt by a multitask learning approach in an endtoend training paradigm. Experimental results on benchmark datasets show that NRT achieves better performance than the stateoftheart models on both tasks of rating prediction and abstractive tips generation. The generated tips can vividly predict the user experience and feelings.
References
 (1)

Almahairi et al. (2015)
Amjad Almahairi, Kyle
Kastner, Kyunghyun Cho, and Aaron
Courville. 2015.
Learning distributed representations from reviews for collaborative filtering. In
RecSys. ACM, 147–154.  Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
 Bing et al. (2015) Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, and Rebecca Passonneau. 2015. Abstractive MultiDocument Summarization via Phrase Selection and Merging. In ACL. 1587–1597.
 Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR 3, Jan (2003), 993–1022.
 Chen et al. (2015) Li Chen, Guanliang Chen, and Feng Wang. 2015. Recommender systems based on user reviews: the state of the art. User Modeling and UserAdapted Interaction 25, 2 (2015), 99–154.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP (2014), 1724–1734.
 Erkan and Radev (2004) Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graphbased lexical centrality as salience in text summarization. JAIR 22 (2004), 457–479.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672–2680.
 He et al. (2015) Xiangnan He, Tao Chen, MinYen Kan, and Xiao Chen. 2015. Trirank: Reviewaware explainable recommendation by modeling aspects. In CIKM. ACM, 1661–1670.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural Collaborative Filtering. In WWW. 173–182.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.
 Koehn (2004) Philipp Koehn. 2004. Pharaoh: a beam search decoder for phrasebased statistical machine translation models. In Conference of the Association for Machine Translation in the Americas. Springer, 115–124.
 Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD. ACM, 426–434.
 Koren et al. (2009) Yehuda Koren, Robert Bell, Chris Volinsky, and others. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097–1105.
 Le and Mikolov (2014) Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents.. In ICML. 1188–1196.
 Lee and Seung (2001) Daniel D Lee and H Sebastian Seung. 2001. Algorithms for nonnegative matrix factorization. In NIPS. 556–562.

Li
et al. (2017)
Piji Li, Zihao Wang,
Wai Lam, Zhaochun Ren, and
Lidong Bing. 2017.
Salience Estimation via Variational AutoEncoders for MultiDocument Summarization. In
AAAI. 3497–3503.  Lin (2004) ChinYew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop, Vol. 8.
 Ling et al. (2014) Guang Ling, Michael R Lyu, and Irwin King. 2014. Ratings meet reviews, a combined approach to recommend. In RecSys. 105–112.
 Marlin (2003) Benjamin M Marlin. 2003. Modeling user rating profiles for collaborative filtering. In NIPS. 627–634.
 McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys. ACM, 165–172.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111–3119.

Ng (2011)
Andrew Ng.
2011.
Sparse autoencoder.
CS294A Lecture notes 72 (2011), 1–19.  Ren et al. (2017) Zhaochun Ren, Shangsong Liang, Piji Li, Shuaiqiang Wang, and Maarten de Rijke. 2017. Social collaborative viewpoint regression with explainable recommendations. In WSDM.
 Ricci et al. (2011) Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. Springer.
 Rosenblatt (1961) Frank Rosenblatt. 1961. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical Report. DTIC Document.
 Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1988. Learning representations by backpropagating errors. Cognitive modeling 5, 3 (1988), 1.

Rush
et al. (2015)
Alexander M Rush, Sumit
Chopra, and Jason Weston.
2015.
A neural attention model for abstractive sentence summarization. In
EMNLP.  Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization.. In NIPS. 1–8.
 Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In ICML. ACM, 791–798.
 Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Itembased collaborative filtering recommendation algorithms. In WWW. ACM, 285–295.
 Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In WWW. ACM, 111–112.
 Shi et al. (2010) Yue Shi, Martha Larson, and Alan Hanjalic. 2010. Listwise learning to rank with matrix factorization for collaborative filtering. In RecSys. 269–272.
 Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In NIPS. 926–934.

Su and
Khoshgoftaar (2009)
Xiaoyuan Su and Taghi M
Khoshgoftaar. 2009.
A survey of collaborative filtering techniques.
Advances in artificial intelligence
2009 (2009), 4.  Theano Development Team (2016) Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints abs/1605.02688 (2016).

Vincent et al. (2010)
Pascal Vincent, Hugo
Larochelle, Isabelle Lajoie, Yoshua
Bengio, and PierreAntoine Manzagol.
2010.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
JMLR 11 (2010), 3371–3408.  Wang and Blei (2011) Chong Wang and David M Blei. 2011. Collaborative topic modeling for recommending scientific articles. In KDD. ACM, 448–456.
 Wang et al. (2015) Hao Wang, Naiyan Wang, and DitYan Yeung. 2015. Collaborative deep learning for recommender systems. In KDD. ACM, 1235–1244.
 Wang et al. (2016) Hao Wang, SHI Xingjian, and DitYan Yeung. 2016. Collaborative recurrent autoencoder: Recommend while learning to fill in the blanks. In NIPS. 415–423.
 Wu et al. (2017) ChaoYuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017. Recurrent Recommender Networks. In WSDM. 495–503.
 Wu et al. (2016a) Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016a. Collaborative denoising autoencoders for topn recommender systems. In WSDM. ACM, 153–162.
 Wu et al. (2016b) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016b. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).
 Xu et al. (2014) Yinqing Xu, Wai Lam, and Tianyi Lin. 2014. Collaborative filtering incorporating review text and coclusters of hidden user communities and item groups. In CIKM. ACM, 251–260.
 Xu et al. (2015) Yinqing Xu, Bei Shi, Wentao Tian, and Wai Lam. 2015. A Unified Model for Unsupervised Opinion Spamming Detection Incorporating Text Generality. In IJCAI. 725–731.
 Zeiler (2012) Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).
 Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In WSDM. ACM, 425–434.