1 Introduction
In the era of information explosion, information overload is one of the dilemmas we are confronted with. Recommender systems (RSs) are instrumental to address this problem, because they assist the users to identify which information is more preferred [Xue et al.2017]
. Further, to achieve better modeling ability of users’ preference, neural architectures that deep learning methods are employed
[He et al.2017b, Xue et al.2017]. There emerge many latest researches in this trend, such as NeuMF [He et al.2017b] and DMF [Xue et al.2017]. Basically, most methods represent the user and item in a hidden semantic manner and then metric the hidden representations to predict the rating by cosine similarity or Multilayer Perceptron (MLP).
Despite the success of previous methods, they are still too simple to characterize users’ complex preference. For the example of movie recommendation, user usually considers the quality of a movie from multiple perspectives, such as acting quality and movie style. It means that all the perspectives make effects on the preference, which traditional neural methods are difficult to characterize. To tackle this problem, in this paper, we encode the user and item into hidden representations from multiple perspectives and then metric the hidden representations to predict the preference.
However, there still exist two challenges for the encoding process: to model hierarchically organized perspectives and to capture the correlation between user and item.
First, the perspectives are hierarchically organized from specific elements to abstract summarization. For the example of movie domain, there are basic aspects such as actor, director and shooting technique, based on which, abstract aspects such as acting quality and movie style are constructed. In detail, movie style is decided by director and shooting technique, while actor and director mostly determine the acting quality. Regarding the neural model, the output of each perspective indicates the representations of user/item metric in that perspective. For example, the encoded representation of user in actor perspective represents the user’s preference for actors, while the encoded representation of item in movie style perspective indicates the style of this movie. The representation in lowlevel should support the analysis in highlevel, which motivates us to employ a hierarchical deep neural architecture. Thus, it is reasonable to apply multiple sequential stages and to encode the user/item from multiple perspectives in each stage.
Second, the correlation between user and item is weak in the encoding process of current models. However, in fact, from the study of psychology [Carlson et al.2009], users’ preference is subjective and would be slightly adjusted according to a specific item, while the subjective feature of a specific item could be slightly different from different users’ insight. Therefore, we employ the attention mechanism [Schmidhuber and rgen2015] to address the correlated effects between user and item.
Specifically in this paper, to model user’s complex preference on item, we propose a novel neural architecture for topN recommendation task. Overall, our model encodes the user and item into hidden semantic representations and then metrics the hidden representations into predicted preference degree with cosine similarity. Specifically, regarding the encoding process, our model leverages several sequential stages to model the hierarchically organized perspectives. In each stage, there exist several perspectives and in each perspective, the representations for user and item would adjust each other by attention mechanism. Besides, we have studied two methods for constructing the attention signal, which are listed as “SoftmaxATT” and “CorrelatedATT”.
We evaluate the effectiveness of our neural architecture for the topN recommendation task in six datasets from five domains (i.e. Movie, Book, Music, Baby Product, Office Product). Experimental results on these datasets demonstrate our model consistently outperforms the other baselines with remarkable improvements and achieves the stateoftheart performance among deep recommendation models.
In summary, our contributions are outlined as follows:

We propose a novel neural architecture for recommendation systems, which focuses on the hierarchically organized perspectives and the correlation between user and item.

To our best knowledge, this is the first paper to introduce attention mechanism into neural recommendation systems.

Experimental results show the effectiveness of our proposed architecture, which outperforms other stateoftheart methods in the topN recommendation task.
The organization of this paper is as follows. First, problem formulation and related work are introduced. Second, our neural architecture is discussed. Third, we conduct the experiments to verify our model. Last, concluding remarks are in the final section.
2 Problem Formulation & Related Work
Suppose there are users and items . Let indicate the rating matrix, where is the rating of user on item and we denote if it is unknown. There are two manners to construct the useritem interaction matrix , which indicates the user whether performs operation on item as
(1)  
(2) 
Most traditional models for recommendation system employ Equation (1) as the input to their models, [Wu et al.2016, He et al.2017b], while some latest work takes the known entry as the ratings rather than as Equation (2) shows [Xue et al.2017]. We apply the second setting, because we suppose the explicit ratings in Equation (2) could reflect the preference level of a user for an item.
The recommendation systems are conventionally formulated as the problem of estimating the rating of each unobserved entry in
, which is leveraged to rank the items. Modelbased approaches that are the mainstream methodology leverage an underlying model to generate all the ratings:(3) 
where denotes the predicted score of interaction between user and item , indicates the model parameter and denotes the recommendation model that predicts the scores. With the predicted scores by model , we could rank the items for an individual user to conduct personalized recommendation.
First, matrix factorization as semantic latent space methodology is proposed for this task. For the classical method of latent factor model [Koren, Bell, and Volinsky2009], which basically applies the inner product of the hidden representations of user and item to predict the entity as follows
(4) 
where means the predicted score, indicates latent factor model, / indicates the hidden representation of user / item , respectively. Also, there follow many related researches such as [Koren2008, Mcauley2013Hidden, Bao2014TopicMF].
Then, extra corpus such as social relationship is incorporated into recommendation for a further improvement, [Ma et al.2008]. However, because the additional corpus is difficult to obtain and is often full of noise, this methodology is still under limitation.
Last, due to the powerful representation learning ability of neural network, deep learning methods have been successfully applied into this field. Restricted Boltzmann Machines
[Salakhutdinov, Mnih, and Hinton2007]are the pioneer for this branch. Meanwhile, autoencoders and the denoising autoencoders have also been investigated for this task,
[Li, Kawale, and Fu2015, Sedhain2015AutoRec, Strub2015Collaborative]. The main principle of these methods is to predict user’s ratings through learning hidden representations with historical behaviors (i.e. ratings and reviews).Recently, to learn nonlinear interactions, neural collaborative filtering (NeuCF) [He et al.2017b]
presents an approach, where users and items are embedded into numerical vectors and then the embeddings are processed by a multilayer perceptron to learn the users’ preference. Deep matrix factorization (DMF)
[Xue et al.2017] jointly takes the spirit of latent factor model and neural collaborative filtering method. Specifically, DMF independently encodes the user and item by multilayer perceptron (MLP) and then metrics the hidden representations of user and item from the MLP in the manner of Equation (4) to predict the preference degree. In fact, DMF takes the advantage of deep representation learning to achieve the stateoftheart performance.There list the notations used in the following sections. indicates a user and indicates an item. and are the index for and , respectively. denotes the useritem interaction matrix, formulated in Equation (2), while denotes the observed interactions, means all zero elements in and denotes the negative instances generated from sampling. Notably, means the training and developing dataset while is the source of testing dataset. Further, we indicate the th row of matrix as , th column as and its th entry as .
3 Methodology
In this section, first, we will introduce the overall sketch of our proposed neural architecture, which is illustrated in Fig.2. Then, we will discuss the details of each component in a bottomup manner, namely interaction matrix, sequential stages and cosine similarity. Also the implementation of each stage and attention mechanism (demonstrated in Fig.3 and Fig.4
) will be analyzed as follows. Last, we present our loss function and training algorithm.
3.1 Neural Architecture
Our neural architecture is demonstrated in Fig.2. Basically, our model is composed by three components, namely interaction matrix, sequential stages and cosine similarity.
Interaction Matrix. Mentioned in previous section, we form the interaction matrix as Equation (2), which is the input of our model. From the interaction matrix , each user is represented as a highdimensional vector , which indicates the corresponding user’s ratings across all items, while each item is represented as a highdimensional vector , which means the corresponding item’s ratings across all users. Notably, it is a conventional trick to fill the unknown entry as . To overcome the sparsity of interaction matrix, the inputs of user and item are transformed by linear layer with the activation function ReLU (i.e ) as
(5)  
(6) 
where is the output of this layer for user/item, / means the input of row/columnspecific interaction matrix for user/item, are the parameters of linear layer and is the activation function (i.e. ReLU).
Sequential Stages. In order to model the hierarchically organized perspectives shown in Fig.1, we leverage multiple sequential stages, shown in Fig.2. In each stage, there exist several perspectives to model the user/item representations from multiple aspects. In each perspective, the output of last stage is regarded as the input of this perspective while the outputs of all the perspectives in one stage are respectively concatenated as the output representation of user and item for this stage, shown in Fig.2.
Specifically in one perspective, first, the inputs of this perspective that the output representations of user and item in last stage are transformed by linear layer with the activation function ReLU.
(7)  
(8) 
where indicates the ReLU function, / is the output for user/item of linear layer in th perspective of th stage, / is the output for user/item of last stage and are model parameters.
Then, attention signal is generated from the output of linear layer by attention mechanism.
(9)  
(10) 
where / is the attention signal for user/item in th perspective of th stage and / is the output for user/item of linear layer in th perspective of th stage. / indicates the attention function for user/item.
Last, the output of this perspective is generated by weighting the output of linear layer with the attention signal in the manner of elementwise product. Mathematically, we have:
(11)  
(12) 
where / is the output of the th perspective in th stage, / is the attention signal for user/item in th perspective of th stage and / is the output for user/item of linear layer in th perspective of th stage. means the elementwise product.
Cosine Similarity. To generate the user’s preference on the item , we measure the output representations of user/item in the final stage with cosine similarity, which is a conventional operation in neural architecture, [Wang, Mi, and Ittycheriah2016], mathematically as
(13)  
where is the predicted preference of user on item , / is the output representation for user/item of the final stage, is the length of vector.
3.2 Attention Mechanism
Motivated in Introduction, to characterize the correlations between user and item, we leverage attention mechanism to refine the encoded representations of user/item as Equation (9) and Equation (10) show. With the attention mechanism, the final representations for user/item are more flexible and more precise to characterize the user’s complex preference on the item.
Firstly, shown in Fig.3
, we directly employ a softmax layer to construct the attention signal, which is a conventional and common form for attentionbased methods,
[Yang et al.2017, cui2016attention, yin2015abcnn], mathematically as:(14)  
(15) 
where / is the attention matrix for user/item in the th perspective of th stage, is the softmax operation for vector and other symbols are introduced in last subsection as / is the attention function for user/item and / is the output for user/item of linear layer in th perspective of th stage.
Notably, the attention matrices are model parameters to learn. Specifically, the attention signal for user is generated from the representation of item, while the attention signal for item is generated from the representation of user, which accords to our motivation of correlation. We call this attention setting as “SoftmaxATT”.
However, the correlation modeled by simple softmax operation could still be improved. For a more effective correlation modeling, we propose a novel attention structure, shown in Fig.4. First, we compute the softmax vectors as the first attention method does:
(16)  
(17) 
where / is the output of softmax layer in th perspective of th stage and other symbols are introduced previously. Then, we construct the correlation matrix between the representation of user and item, as
(18) 
where / is the output of softmax layer, is the correlation matrix of th perspective in th stage, which contains the correlated information of all the dimensions for user/item. Last, we process the correlation matrix with activation function and average the row/column as the attention vector for user/item, as
(19)  
(20) 
where / indicates the average operation for row/column and other symbols are introduced previously. With the explicit computation of correlation matrix, the correlated effects between user and item could be characterized to a better extent. We call this attention setting as “CorrelatedATT”.
3.3 Training
The definition of objective function for model optimization is critical for recommendation models. Specifically, regarding our model, we take advantage of pointwise objective function and crossentropy loss. Actually, though the square loss is largely performed in many existing models, [Hu, Koren, and Volinsky2008, mnih2008probabilistic], neural architectures usually employ crossentropy loss [He et al.2017a, wu2017sequence]. Thus, our objective function is as
(21) 
where is the objective function, is the golden rating, is the predicted score and other symbols are introduced in Related Work. Specifically as previous literatures [He et al.2017a, Xue et al.2017], the target value
is a binarized
or for the rating , denoting whether the user has interacted with itemor not. Besides, the model is trained using Stochastic Gradient Descent (SGD) with Adam
[Kingma and Ba2014], which is an adaptive learning rate algorithm.The training process needs the negative samples and all the ratings in the training set are the positive ones. Thus, we randomly sample several negative samples that are not in the training/developing/testing dataset for one positive sample. Besides, we apply the concept of negative sample ratio to illustrate how many negative samples would be generated for one positive instance.
Datasets  Metrics  Baselines  Our Methods  Improvements over the Best Baseline  

NeuMF  DMF  SoftmaxATT  CorrelatedATT  SoftmaxATT  Correlated ATT  
Movie  NDCG  0.395  0.400  0.402  0.410  0.50%  2.50% 
HR  0.670  0.676  0.686  0.688  1.48%  1.78%  
Movie1M  NDCG  0.440  0.445  0.447  0.448  0.45%  0.67% 
HR  0.722  0.723  0.732  0.735  1.24%  1.66%  
Book  NDCG  0.477  0.471  0.483  0.484  1.26%  1.47% 
HR  0.676  0.667  0.690  0.694  2.07%  2.81%  
Music  NDCG  0.220  0.230  0.253  0.262  10.00%  13.90% 
HR  0.371  0.382  0.428  0.445  12.04%  16.49%  
Baby  NDCG  0.160  0.162  0.172  0.182  6.17  12.34% % 
HR  0.285  0.287  0.321  0.366  11.85%  27.52%  
Office Product  NDCG  0.233  0.243  0.261  0.262  7.40%  7.81% 
HR  0.518  0.520  0.521  0.532  0.19%  2.30% 
NDCG@10 and HR@10 Comparisons of Different Methods. We conduct ttest for statistical significance and
, which means all of the improvements are statistically significant.4 Experiment
In this section, first, we will introduce the basic experimental settings, namely datasets, evaluation and implementation. Then, we will conduct the experiments about model performance. Last, we will analyze the sensitivity to hyperparameters for our model.
4.1 Experimental Setting
Datasets. We evaluate our models on six widely used datasets from five domains in recommender systems: MovieLens 100K (Movie), MovieLens 1M (Movie1M), Amazon music (Music), Amazon Kindle books (Book), Amazon office product (Office) and Amazon baby product (Baby). ^{1}^{1}1https://grouplens.org/datasets/movielens/ ^{2}^{2}2http://jmcauley.ucsd.edu/data/amazon/ We process the datasets, according to the previous literatures [Wu et al.2016, Xue et al.2017, He et al.2017b]. For the datasets of Movie and Movie1M, we do not process them, because they are already filtered. Besides, other datasets are filtered to be similar to MovieLens data: only those users with at least 20 interactions and items with at least 5 interactions are retained.^{3}^{3}3We will publish our filtered datasets, once accepted. We list the statistics of all the six processed datasets in Tab.2.
Statistics  #Users  #Items  #Ratings  Density 

Movie  994  1.683  100,000  6.294% 
Movie1M  6,040  3,706  1,000,209  4.468% 
Music  1,776  12,929  46,087  0.201% 
Book  14,803  96,538  627,441  0.004% 
Office  941  6,679  27,254  4.336% 
Baby  1,100  8,539  30,166  0.321% 
Evaluation. To verify the performance of our model for item recommendation, we adopted the leaveoneout evaluation, which has been widely used in the related literatures [He et al.2017b, Xue et al.2017]. We heldout the latest interaction as the test item for each user and utilize the remaining dataset for training. Since it is too timeconsuming to rank all the items for every user during testing, following [Koren, Bell, and Volinsky2009, He et al.2017b, Xue et al.2017], we randomly sample 100 items that are not interacted by the corresponding user as the test set for this user. Among the 100 items together with the test item, we get the rank according to the prediction scores. We also use Hit Ratio (HR) and Normalized Discounted Cumulative Grain (NDCG) to evaluate the ranking performance, [Xue et al.2017, He et al.2017a]. As default, in our experiments, we truncate the rank list at 10 for both metrics, where HR/NDCG intuitively means HR@10/NDCG@10, as previous literatures [Xue et al.2017]. It is the similar notation for HR@K/NDCG@K.
Detailed Implementation.
We implement our proposed methods based on Tensorflow
^{4}^{4}4https://www.tensorflow.org and the released codes of DMF [Xue et al.2017]. Our codes will be released publicly upon acceptance. To determine the hyperparameters of our model, we randomly sample one interaction for each user as the developing data and tune hyperparameters on it. For neural part of our model, we randomly initialize model parameters with a Gaussian distribution (with the mean of
and standard deviation of
).We test the batch size of , the negative instance number per positive instance of , the learning rate of , the number of stage , the number of perspectives in each stage , the dimension of all the linear layers , the dimension of the output of nonfinal stage and the dimension of the output of final stage . The optimal settings for our model are listed as: batch size as , negative instance number per positive instance as , learning rate as , number of stage as , number of perspectives of each stage as , the dimension of all the linear layers as , the dimension of the output of nonfinal stage as and the dimension of the output of final stage as .
4.2 Performance Verification
Baselines. As our proposed methods aim to model the relationship between users and items, we follow [Xue et al.2017] and [He et al.2017b] to mainly compare with useritem models. Thus, we leave out the comparison with itemitem models, such as CDAE [Wu et al.2016]. Actually, since the neural recommendation methodology just starts to be focused, we just list two suitable latest baseline models.
NeuMF. This is a neural matrix factorization method for item recommendation. This method embeds the user and item as hidden representations and then leverages a multiple layer perceptron to learn the useritem action function based on the embeddings of user and item. We implement the pretraining version of NeuMF and tune its hyperparameters in the same way as [He et al.2017b].
DMF. This is the stateoftheart neural recommendation method. This method encodes the user and item into hidden representations independently and metrics the representations between user and item to predict the user’s preference degree for the item. We implement DMF and tune its hyperparameters in the same way as [Xue et al.2017].
Conclusions. The comparisons are illustrated in Tab.1. Thus, we have concluded as below:

Our method outperforms the baselines extensively, which justifies the effectiveness of our model.

“CorrelatedATT” performs better than “SoftmaxATT”, which means to characterize the correlations between user and item would improve the model performance.

There exist some domains, where the promotion is obviously larger than the others. We suppose there exist more clear hierarchical perspectives in these domains. For the example of Music domain, there are many lowlevel aspects such as singer, writer, composer, volume and speed, based on which, highlevel aspects such as genre, style, melody are constructed and analyzed.
4.3 Sensitive to HyperParameters
In this subsection, in order to verify the effect of hyperparameters, we leverage the “CorrelatedATT” setting for attention mechanism and also the optimal experimental setting that are introduced in Implementation as default.
HR@K & NDCG@K. Fig.6 shows the performance of top recommended lists where the ranking position ranges from to . As can be concluded, our method demonstrates consistent improvements over other methods across different . For the dataset of Movie, our model outperforms DMF by 0.0239 for HR@K and 0.010 for NDCG@K in average, while for the dataset of Music, our method promotes DMF by 0.0360 for HR@K and 0.0261 for NDCG@K in average. This comparison demonstrates the consistent effectiveness of our methods.
Effect of Number of Negative Samples.
Argued in the previous section, our method samples negative instances from unobserved data for training. In this experiment, different negative sampling ratios are tested for the performance variance (e.g neg5 indicates that the negative sampling ratio is 5 or we sample 5 negative instances per positive instance). From the results in Tab.
3, we discover that larger negative sample ratio could lead to better performance, while overlarge ratio seems to harm the results. For the example of NDCG on the dataset of Movie, the performance increases before neg5, while it drops after neg9. In detail, the optimal negative sample ratio is around 5, which consistently accords to the previous researches, [He et al.2017a, Xue et al.2017].Datasets  Metric  Negative Sample Ratio  

1  2  5  9  10  
Movie  NDCG  0.342  0.351  0.368  0.367  0.364 
HR  0.615  0.633  0.642  0.646  0.645  
Music  NDCG  0.202  0.205  0.217  0.224  0.216 
HR  0.341  0.345  0.360  0.372  0.359  
Baby  NDCG  0.168  0.169  0.170  0.173  0.182 
HR  0.312  0.321  0.322  0.319  0.335  
Office  NDCG  0.235  0.235  0.254  0.240  0.244 
HR  0.506  0.507  0.507  0.513  0.512 
Effect of Number of Layers. Since we model the hierarchically organized perspectives, the depth or the layer number could be a critical factor in our method. Thus, we conduct experiments to test the effect of depth. Shown in Fig.7, we could conclude that the 3layer architectures work best among all the present models. Specifically, on the dataset of Movie, the optimal performance of layer3 outperforms that of layer2 by 0.021 for HR and 0.019 for NDCG, while on the dataset of Music, the optimal performance of layer3 improves that of layer2 by 0.072 for HR and 0.014 for NDCG. Thus, we conjecture deeper models could extract more abstract perspectives, which help to boost the performance.
Effect of Final Latent Dimension. Besides the negative sample ratio and the number of layers, the final latent dimension is also a sensitive factor, which directly guides the generation of predicted user’s preference. We vary the final latent dimension from to for the experiments. Demonstrated in Tab.4, we observe that larger final dimension leads to better performance. For the example of Movie dataset, HR increases with latent dimension number. Thus, we suppose larger latent dimension could encode more information into the final results, which could lead to better prediction accuracies.
Datasets  Metric  Final Latent Dimension  

8  16  32  64  128  
Movie  NDCG  0.392  0.395  0.400  0.390  0.410 
HR  0.656  0.667  0.663  0.687  0.688  
Music  NDCG  0.241  0.246  0.248  0.250  0.262 
HR  0.383  0.392  0.430  0.433  0.445  
Office  NDCG  0.263  0.248  0.273  0.276  0.262 
HR  0.526  0.514  0.525  0.523  0.532  
Book  NDCG  0.476  0.480  0.480  0.488  0.484 
HR  0.689  0.690  0.691  0.692  0.694 
Training Loss and Performance. Fig.5 shows the training loss (averaged over all the training instances) and recommendation performance of our method and stateoftheart baselines of each iteration on the dataset of Movie. Results on the other datasets show the same trend, thus they are omitted for limited pages. From the results, we could draw two observations. First, we could see that with more iterations, the training loss of our method gradually decreases and the recommendation performance is promoted. The most effective updates are in first 10 iterations and more iterations increase the risk of overfitting, which accords to our common knowledge. Second, our method achieves the lower training loss than DMF, which illustrates that our model could fit the data in a better degree. Thus, a better performance over DMF is expected. Overall, the experiments show the effectiveness of our method.
5 Conclusion
In this paper, we propose a novel neural architecture for recommendation system. Our model encodes the user and item from multiple hierarchically organized perspectives with attention mechanism and then metrics the abstract representations to predict the user’s preference on the item. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed methods. We will publish our poster, slides, datasets and codes at https://www.github.com/....
References
 [Carlson et al.2009] Carlson, N. R.; Heth, D.; Miller, H.; Donahoe, J.; and Martin, G. N. 2009. Psychology: the science of behavior. Pearson.
 [He et al.2017a] He, S.; Liu, C.; Liu, K.; and Zhao, J. 2017a. Generating natural answers by incorporating copying and retrieving mechanisms in sequencetosequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 199–208.
 [He et al.2017b] He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T. 2017b. Neural collaborative filtering. 25th international world wide web conference 173–182.
 [Hu, Koren, and Volinsky2008] Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 263–272. Ieee.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Koren, Bell, and Volinsky2009] Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8):30–37.
 [Koren2008] Koren, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426–434.
 [Li, Kawale, and Fu2015] Li, S.; Kawale, J.; and Fu, Y. 2015. Deep collaborative filtering via marginalized denoising autoencoder. In ACM International on Conference on Information and Knowledge Management, 811–820.
 [Ma et al.2008] Ma, H.; Yang, H.; Lyu, M. R.; and King, I. 2008. Sorec:social recommendation using probabilistic matrix factorization. In Acm Conference on Information and Knowledge Management, 931–940.

[Salakhutdinov, Mnih, and
Hinton2007]
Salakhutdinov, R.; Mnih, A.; and Hinton, G.
2007.
Restricted boltzmann machines for collaborative filtering.
In
International Conference on Machine Learning
, 791–798.  [Schmidhuber and rgen2015] Schmidhuber, J., and rgen. 2015. Deep learning in neural networks. Elsevier Science Ltd.
 [Wang, Mi, and Ittycheriah2016] Wang, Z.; Mi, H.; and Ittycheriah, A. 2016. Semisupervised clustering for short text via deep representation learning. In the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL).
 [Wu et al.2016] Wu, Y.; Dubois, C.; Zheng, A. X.; and Ester, M. 2016. Collaborative denoising autoencoders for topn recommender systems. In ACM International Conference on Web Search and Data Mining, 153–162.

[Xue et al.2017]
Xue, H. J.; Dai, X. Y.; Zhang, J.; Huang, S.; and Chen, J.
2017.
Deep matrix factorization models for recommender systems.
In
International Joint Conference on Artificial Intelligence
, 3203–3209.  [Yang et al.2017] Yang, Z.; Hu, J.; Salakhutdinov, R.; and Cohen, W. W. 2017. Semisupervised qa with generative domainadaptive nets. arXiv preprint arXiv:1702.02206.
Comments
There are no comments yet.