I Introduction
Nowadays, recommender systems [18, 15, 2, 19] are so ubiquitous in information systems that their absence draws attention and not vice versa. Making personalized predictions for specific users, based on some functional dependency of past interactions of all users and items is known as collaborative filtering [24, 23] approach. Within this approach, different matrix factorization (MF) techniques have been proven to be quite accurate and scalable for many recommender scenarios [9, 10, 12]. Essentially, they map both users and items to a joint latent factor space of lower dimension and model interaction as their inner product in latent space.
Recently, various deep learning models [1] such as Restricted Boltzmann Machines [14], stacked autoencoders [28, 11]
, deep neural networks
[6] or deep matrix factorizations [29]were introduced to the collaborative filtering methods. Two aspects of collaborative filtering were upgraded: (i) linear latent representations of users and items were replaced by deep representations and (ii) inner product was replaced by nonlinear function represented with deep neural networks. This has motivated us to study which of this two aspects: (i) nonlinear latent representations or (ii) nonlinear interaction functions dominates the ability to learn incomplete ratings. In this paper, we are focused on the matrix factorization recommender models. Note, that the representations of classical matrix factorizations such as nonnegative matrix factorization (NMF) or singular value decomposition (SVD) are essentially the same as the ones learned by a basic linear autoencoders
[1].In this paper, we focus on the collaborative filtering task of learning explicit ratings based on collaborative filtering. For recent advances in using deep models with an auxiliary information and implicit feedback [8]. The main contributions of the paper are the following: (i) We propose a simple model of the compositions of nonlinear matrix factors for learning incomplete explicit ratings. (ii) We evaluated our approach against a variety of baselines including both linear and nonlinear methods. (iii) In the supervised rate prediction task, our simple linear combination of nonlinear representations has lower prediction errors (RMSE) on holdout datasets, thereby leading to better generalization ability. (iv) In the unsupervised clustering task performed on the obtained representations, we demonstrate that our approach attains comparable representation ability in comparison to complex deep matrix factorization, by presenting comparable clustering performance metric (withincluster sum of squares).
Ii Preliminaries
In this section, we formulate the problem setting. The useritem rating matrix is denoted by . The original representation of user is the ith row of matrix i.e. , while the original representation of item is the th column of matrix i.e. . Let denote the latent feature matrix of users, where
denotes the latent feature vector of user
i.e. th row of matrix . Similarly, let denote the latent feature matrix of items, where denotes the latent feature vector of item i.e. th column of .The collaborative filtering task is to estimate the unobserved rating for user and item as
(1) 
where denotes the model parameters of interaction function .
The latent features and parameters are found by minimizing the objective function
(2) 
where
is a pointwise loss function (for pairwise learning see
[3]), is a regularizer (L2 or L1 norm [13]) and denotes sets of training instances over which learning is done (see [22], about the missing data assumptions).The regularized matrix factorization [9, 12] approach is modeling the interaction function as the linear combination , which represent the inner product in the latent space. The latent factors are linear representations, due to the absence of nonlinearity in the linear algebra transformations . Here, usually L2 function is used for loss and regularization, while for the training set , set of observed ratings were used.
Neural collaborative filtering [6] models the interaction function as a deep neural network , where and are mapping of output layer and xth layer of the form . The
represents the nonlinear function such as rectified linear unit, hyperbolic tangent or others. The model parameters
and latent factors are learned in a joint manner, where the latent factors are nonlinear representations of user and item vectors from rating matrix .Deep matrix factorizations [29] model the interaction function as an inner product . However, both user and item features are deep representations of the form: and .
Iii Proposed Model
In order to learn incomplete ratings, we formulate the following multilayer semiNMF model (NSNMF):
,
where is the rating matrix, is the bias matrix, is the latent user preferences matrix and is the matrix of the nonlinear item latent features. To model nonlinear representation of the items, we use the following model , which finally leads to:
(3) 
where is the hidden latent representation of items, is the weighting matrix between latent representations on different levels and is the elementwise nonlinear function to better approximate the nonlinear manifolds of the latent features.
Similar architecture to our proposed method without bias term is used in the task of learning deep representation of images [27], where the learning depends on dense multiplicative updates.
Simply, we model the interaction function as an inner product with offset , where user feature is the th row of matrix and item nonlinear representation is the th column of . This model falls into the category of semiNMF factorization models [4]. SemiNMF is a variant of NMF, which imposes nonnegativity constraints only on the latent factors of the second layer . This allows both positive and negative offsets from the bias term . Usually, in clustering represents cluster centroids and represents soft membership indicator [4]. But, from recommender perspective, matrix
may be interpreted as the linear regression coefficients and the nonnegativity constraints imposed on latent item attributes
allow partbased representations [20, 21].Note, that much of the observed variation in rating values is due to the effects associated with either users or items, known as biases or intercepts, independent of any interactions. Thus, instead of explaining the full rating value by an interaction of the form , the system can try to identify the portion of these values that individual user or item biases cannot explain. Bias involved in a rating can be computed as follows: , where is the overall average rating, is the observed deviations of user u and is the observed deviations of item i.
Note, that in general case, we are able to compose more nonlinearities with the following relation , which generates the following model . However, more complex item latent representation were showed not to be useful, see experiments section for more details.
Iiia Learning Model Parameters
For a two layered item features structure, the model parameters are updated through an elementwise gradient descent approach, minimizing eq.(2), with squared loss function and L2 regularization .
The model parameters are randomly initialized uniformly at the range [0,1], and we perform iterative updates for each observed rating as follows:
(4) 
(5) 
(6) 
(7) 
(update only if
(8) 
(update only if ,
where g’(.)
is the derivative of activation function and
is the error term. Note, that we do not explicitly store the dense matrix . The computational complexity for training a 2layer itemfeature NSNMF architecture is of order , where k,l are the dimensions of layer , and t the number of iterations. The learning rate was configured with the AdaGrad method [5], performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is wellsuited for dealing with sparse data, as in our case of incomplete ratings. Given k,l are constant and , the scalability of the proposed method linearly depend on the number of users and items.Iv Experimental evaluation
In this section, we report experimental results by evaluating our NSNMF approach and a variety of baselines in both supervised and unsupervised tasks.
Iva Datasets
We have three real datasets as follows:

MovieLens100K ^{1}^{1}1https://grouplens.org/datasets/movielens/100k/, generated on October 17, 2016.  100004 ratings, across 9125 movies from 671 users. (have users with at least 20 ratings)

FilmTrust ^{2}^{2}2https://www.librec.net/datasets.html  28496 ratings, across 1981 movies from 654 users. (filtered to have users with at least 20 ratings)

Amazon Music ^{3}^{3}3http://jmcauley.ucsd.edu/data/amazon/  50395 ratings, across 1188 items from 19260 users. (filtered to have users with at least 20 ratings and items with at least 2 interactions)
Each dataset is split into training (), and testing sets (
). The training set is then used for 10fold crossvalidation for hyperparameter tuning.
IvB Baselines and setup
We use baselines including both linear and nonlinear approaches as follows:
CF Neighborhood models are the most common approach to CF with useroriented and itemoriented methods [15, 16]. They are respectively referred to as UserUser CF and ItemItem CF.
SVD is applied in the collaborative filtering domain by factorizing the useritem rating matrix [17] by updating only for the known ratings.
NMF Nonnegative matrix factorization (NMF) which introduces the nonnegative constraint into a MF process [12, 30] along with the regularized NMF mode to avoid overfitting.
RBM Restricted Boltzmann Machine (RBM) [14] is an undirected graphical model, which contains a layer of visible softmax units for items and a hidden binary unit for user rating. Each hidden unit could then learn to model a significant dependency between the ratings of different movies.
DMF presents a deep structure learning architecture to learn deep low dimensional representations respectively for users and items [29]. They use both explicit ratings and implicit feedback for a better optimization of a normalized cross entropy loss function to predict scaled ratings on a continuous scale [0,1].
As for our proposed NSNMF method, we evaluate it with different activation functions as follows: (i)
NSNMF ReLU
is the NSNMF model with Rectified Linear unit: as the activation function, (ii) NSNMF SoftPlus uses softplus: as the activation function and (iii) NSNMF ReLU_bias is the proposed model with rectified linear unit activation function plus bias.In the supervised task, since we focus on explicit ratings, the rooted mean square error is used to assess the rate prediction performance [7]: . The less the value of RMSE, the better the approach performs.
In the unsupervised task, we aim to inspect the difference in representations obtained by our NSNMF and baseline approaches. We choose an unsupervised clustering task on such representations and thus use the pooled withincluster sum of squares around the cluster means (WCSS) [25]: , where denotes the number of elements inside of cluster and is the euclidean distance between instances and within the same cluster.
IvC Supervised tasks
In the supervised task, we perform 10fold crossvalidation error for each dataset to determine the dimensions of the hidden representation and regularizing parameter for each approach. The dimensions of hidden representation were determined from crossvalidation result for values in {4,6,8,10,15,20}. The learning rate and regularizer value were varied in the range of {0.1, 0.01, 0.001}.
The final models were trained with learning rate 0.01 and regularizing parameter 0.1 and factors value of 4,6,8 for FilmTrust, AMusic and MovieLens datasets respectively.
Then, we report the rate prediction RMSE errors in Table I.
In Table I, we observe that NSNMF based approaches, i.e. NSNMF ReLU, Softplus and ReLU_bias outperformed baselines across all datasets. Especially, NSNMF with ReLU_bias performed the best and achieved up to
less RMSE. Meanwhile, DMF which learns nonlinear representations has lower RMSE than the baselines based on linear transformation i.e. UserUser CF, ItemItem CF, SVD, NMF and regularized NMF in most of the time. We trained
^{4}^{4}4https://github.com/RuidongZ/Deep_Matrix_Factorizatio_Models DMF [29] using normalized cross entropy loss on both implicit and explicit ratings. The predicted ratings on the scale [0,1], when scaled to original scale [0,max(R)], where max(R) denotes the max score in all ratings, perform worse than baselines compared to real ratings on the same scale with RMSE measure. Thus, we use their DMF architecture, trained using squared loss function to predict unscaled ratings, which are then evaluated again with RMSE.Algorithm  FilmTrust  ML100K  AMusic 
UserUser CF  0.963  1.005  1.011 
ItemItem CF  0.822  1.001  0.934 
SVD  1.006  1.018  2.024 
NMF  0.845  0.954  1.001 
Regularized NMF  0.840  0.937  0.975 
RBM  0.918  1.008  1.104 
DMF  0.821  0.948  0.946 
NSNMF ReLU  0.816  0.904  0.889 
NSNMF Softplus 
0.804  0.896  0.871 
NSNMF ReLU_bias 
0.788  0.887  0.836 
Furthermore, we trained our model with different numbers of hidden layers to assess the prediction errors on all three datasets. We found that 2layer architecture better models the variation in the rating matrix, while deeper layers even decrease the performance. Due to the page limitation, we report the results up to 3 layers in Table II.
Algorithm  FilmTrust  ML100K  AMusic 
ReLu_2 layer  0.816  0.904  0.889 
ReLu_3 layer  0.842  0.938  0.932 

IvD Unsupervised task
In this part, we perform an unsupervised Kmeans clustering method to evaluate the item representation learned by different approaches in latent spaces. We performed each approach with the hyperparameter set via crossvalidation and then obtain the derived representation.
In Figure 1, we report the WCSS [25] of each approach w.r.t. the number of clusters. We observe that our NSNMF ReLu and DMF constantly yield lower WCSS than NMF. It suggests that representation derived by nonlinear matrix factorization demonstrates higher representation ability. The WCSS of NSNMF ReLu and DMF are quite comparable. It indicates that nonlinear transformation is the dominant part while the way of the combination of such representation results in a minor difference in the derived representation. Moreover, the simple linear combination of nonlinear representation leads to better generalization ability in supervised prediction, which is already demonstrated in Table I.
V Discussion and Future Work
In this paper, we focus on learning the nonlinear item representations for the explicit feedback and left the extension of learning nonlinear implicit feedback representations for future work. Most of the deep learning architectures have been implemented using dense implicit feedback rating matrix. In this paper, we implement the proposed architecture for explicit feedback only, and left the implementation on implicit feedback for future work. We believe it will be interesting to see the performance of the proposed algorithm on implicit feedback, which will provide a better comparison with the deep learning algorithms [6, 28] that are training only on implicit feedback. Thus, the current paper includes the deep learning methods that use explicit feedback in their training algorithms[14, 29].
We find out that simple linear regression over nonlinear item representations is sufficient to overcome the performance of other deep learning methods that use explicit feedback in their training algorithms [14, 29]. It is important to stress, that in our model the linear regression and nonlinear item representations are learned in a joint manner via nonlinear semi nonnegative matrix factorization. The nonnegative constraint allows better interpretability of item features e.g. movie cannot have a negative number of certain actors, a negative indication to certain genre etc. However, the semi nonnegativity constraint allows the regression coefficients to become negative e.g. negative relation to certain item features.
Furthermore, the linear interaction of nonlinear item features provides better predictions than the combination of nonliner item and nonlinear user features, as in the case of Deep Matrix Factorization model [29].
Vi Conclusions
We introduced a multilayer nonlinear seminonnegative matrix factorization method to learn from the incomplete rating matrix. The multilayer approach, which automatically learns the hierarchy of attributes of the items, as well as the nonnegative constraint help in the better interpretation of these factors. Furthermore, we presented an algorithm for optimizing the factors of our architecture with different nonlinearities. We evaluate our approach in comparison to a variety of matrix factorization and deep learning baselines using both supervised rate prediction and unsupervised clustering in latent item space. The results offer the insights as follows: (i) simple linear combination of nonlinear representations realized in our proposed approach achieves better generalization ability, that is, lower errors in the prediction on holdout datasets. (ii) in the unsupervised clustering task, we find out that the representations learned by our approach yield comparable clustering performance metric (withincluster sum of squares) as deep matrix factorization.
Acknowledgement
N.A.F. and T.G. are grateful for financial support from the EU Horizon 2020 project SoBigData under grant agreement No. 654024. The authors acknowledge D. Tolic for useful directions regarding Deep SemiNMF approach in the early stage of work.
References
References
 [1] Bengio, Yoshua, Aaron Courville, and Pascal Vincent. ”Representation learning: A review and new perspectives.” IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 17981828.
 [2] Bobadilla, Jesús, et al. ”Recommender systems survey.” Knowledgebased systems 46 (2013): 109132.

[3]
Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li. 2007. Learning to rank. In Proceedings of the 24th international conference on Machine learning  ICML ’07. ACM Press.
 [4] Ding, Chris HQ, Tao Li, and Michael I. Jordan. ”Convex and seminonnegative matrix factorizations.” IEEE transactions on pattern analysis and machine intelligence 32.1 (2010): 4555.

[5]
Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research 12.Jul (2011): 21212159.
 [6] He, Xiangnan, et al. ”Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
 [7] Herlocker, Jonathan L., et al. ”Evaluating collaborative filtering recommender systems.” ACM Transactions on Information Systems (TOIS) 22.1 (2004): 553.
 [8] Karatzoglou, Alexandros, and Balázs Hidasi. ”Deep learning for recommender systems.” Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 2017.
 [9] Koren, Yehuda, Robert Bell, and Chris Volinsky. ”Matrix factorization techniques for recommender systems.” Computer 8 (2009): 3037.
 [10] Y. Koren R. Bell F. Ricci L. Rokach B. Shapira P. B. Kantor ”Advances in collaborativefiltering” in Recommender Systems Handbook New York NY USA:Springer pp. 145186 2011.
 [11] Li, Sheng, Jaya Kawale, and Yun Fu. ”Deep collaborative filtering via marginalized denoising autoencoder.” Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015.
 [12] Luo, Xin, et al. ”An efficient nonnegative matrixfactorizationbased approach to collaborative filtering for recommender systems.” IEEE Transactions on Industrial Informatics 10.2 (2014): 12731284.
 [13] Ning, Xia, and George Karypis. ”Slim: Sparse linear methods for topn recommender systems.” 2011 11th IEEE International Conference on Data Mining. IEEE, 2011.
 [14] Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. ”Restricted Boltzmann machines for collaborative filtering.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.
 [15] Sarwar, Badrul, et al. ”Itembased collaborative filtering recommendation algorithms.” Proceedings of the 10th international conference on World Wide Web. ACM, 2001.
 [16] Karypis, George. ”Evaluation of itembased topn recommendation algorithms.” Proceedings of the tenth international conference on Information and knowledge management. ACM, 2001.
 [17] Sarwar, Badrul, et al. Application of dimensionality reduction in recommender systema case study. No. TR00043. Minnesota Univ Minneapolis Dept of Computer Science, 2000.
 [18] Schafer, J. Ben, Joseph A. Konstan, and John Riedl. ”Ecommerce recommendation applications.” Data mining and knowledge discovery 5.12 (2001): 115153.
 [19] Smyth, Barry. ”Casebased recommendation.” The adaptive web. Springer, Berlin, Heidelberg, 2007. 342376.
 [20] Lee, Daniel D., and H. Sebastian Seung. ”Learning the parts of objects by nonnegative matrix factorization.” Nature 401.6755 (1999): 788.
 [21] Lee, Daniel D., and H. Sebastian Seung. ”Algorithms for nonnegative matrix factorization.” Advances in neural information processing systems. 2001.
 [22] Steck, Harald. ”Training and testing of recommender systems on data missing not at random.” Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010. Y. Koren R. Bell F. Ricci L. Rokach B. Shapira P. B. Kantor ”Advances in collaborativefiltering
 [23] Adomavicius, Gediminas, and YoungOk Kwon. ”Improving aggregate recommendation diversity using rankingbased techniques.” IEEE Transactions on Knowledge and Data Engineering 24.5 (2012): 896911.

[24]
Su, Xiaoyuan, and Taghi M. Khoshgoftaar. ”A survey of collaborative filtering techniques.” Advances in artificial intelligence 2009 (2009).
 [25] Tibshirani, Robert, Guenther Walther, and Trevor Hastie. ”Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411423.

[26]
Tolic, Dijana, Nino AntulovFantulin, and Ivica Kopriva. ”A Nonlinear Orthogonal NonNegative Matrix Factorization Approach to Subspace Clustering.” Pattern Recognition 82 (2018): 4055.
 [27] Trigeorgis, George, et al. ”A deep matrix factorization method for learning attribute representations.” IEEE transactions on pattern analysis and machine intelligence 39.3 (2017): 417429.
 [28] Wang, Hao, Naiyan Wang, and DitYan Yeung. ”Collaborative deep learning for recommender systems.” Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
 [29] Xue, HongJian, et al. ”Deep Matrix Factorization Models for Recommender Systems.” IJCAI. 2017.
 [30] Zhang, Sheng, et al. ”Learning from incomplete ratings using nonnegative matrix factorization.” Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006.
Comments
There are no comments yet.