Learning from Incomplete Ratings using Nonlinear Multi-layer Semi-Nonnegative Matrix Factorization

10/16/2017 ∙ by Vaibhav Krishna, et al. ∙ ETH Zurich 0

Recommender systems problems witness a growing interest for finding better learning algorithms for personalized information. Matrix factorization that estimates the user liking for an item by taking an inner product on the latent features of users and item have been widely studied owing to its better accuracy and scalability. However, it is possible that the mapping between the latent features learned from these and the original features contains rather complex nonlinear hierarchical information, that classical linear matrix factorization can not capture. In this paper, we aim to propose a novel multilayer non-linear approach to a variant of nonnegative matrix factorization (NMF) to learn such factors from the incomplete ratings matrix. Firstly, we construct a user-item matrix with explicit ratings, secondly we learn latent factors for representations of users and items from the designed nonlinear multi-layer approach. Further, the architecture is built with different nonlinearities using adaptive gradient optimizer to better learn the latent factors in this space. We show that by doing so, our model is able to learn low-dimensional representations that are better suited for recommender systems on several benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays, recommender systems [18, 15, 2, 19] are so ubiquitous in information systems that their absence draws attention and not vice versa. Making personalized predictions for specific users, based on some functional dependency of past interactions of all users and items is known as collaborative filtering [24, 23] approach. Within this approach, different matrix factorization (MF) techniques have been proven to be quite accurate and scalable for many recommender scenarios [9, 10, 12]. Essentially, they map both users and items to a joint latent factor space of lower dimension and model interaction as their inner product in latent space.

Recently, various deep learning models [1] such as Restricted Boltzmann Machines [14], stacked auto-encoders [28, 11]

, deep neural networks

[6] or deep matrix factorizations [29]

were introduced to the collaborative filtering methods. Two aspects of collaborative filtering were upgraded: (i) linear latent representations of users and items were replaced by deep representations and (ii) inner product was replaced by non-linear function represented with deep neural networks. This has motivated us to study which of this two aspects: (i) non-linear latent representations or (ii) non-linear interaction functions dominates the ability to learn incomplete ratings. In this paper, we are focused on the matrix factorization recommender models. Note, that the representations of classical matrix factorizations such as non-negative matrix factorization (NMF) or singular value decomposition (SVD) are essentially the same as the ones learned by a basic linear auto-encoders


In this paper, we focus on the collaborative filtering task of learning explicit ratings based on collaborative filtering. For recent advances in using deep models with an auxiliary information and implicit feedback [8]. The main contributions of the paper are the following: (i) We propose a simple model of the compositions of non-linear matrix factors for learning incomplete explicit ratings. (ii) We evaluated our approach against a variety of baselines including both linear and non-linear methods. (iii) In the supervised rate prediction task, our simple linear combination of non-linear representations has lower prediction errors (RMSE) on hold-out datasets, thereby leading to better generalization ability. (iv) In the unsupervised clustering task performed on the obtained representations, we demonstrate that our approach attains comparable representation ability in comparison to complex deep matrix factorization, by presenting comparable clustering performance metric (within-cluster sum of squares).

Ii Preliminaries

In this section, we formulate the problem setting. The user-item rating matrix is denoted by . The original representation of user is the i-th row of matrix i.e. , while the original representation of item is the -th column of matrix i.e. . Let denote the latent feature matrix of users, where

denotes the latent feature vector of user

i.e. -th row of matrix . Similarly, let denote the latent feature matrix of items, where denotes the latent feature vector of item i.e. -th column of .

The collaborative filtering task is to estimate the unobserved rating for user and item as


where denotes the model parameters of interaction function .

The latent features and parameters are found by minimizing the objective function



is a point-wise loss function (for pair-wise learning see

[3]), is a regularizer (L2 or L1 norm [13]) and denotes sets of training instances over which learning is done (see [22], about the missing data assumptions).

The regularized matrix factorization [9, 12] approach is modeling the interaction function as the linear combination , which represent the inner product in the latent space. The latent factors are linear representations, due to the absence of non-linearity in the linear algebra transformations . Here, usually L2 function is used for loss and regularization, while for the training set , set of observed ratings were used.

Neural collaborative filtering [6] models the interaction function as a deep neural network , where and are mapping of output layer and x-th layer of the form . The

represents the non-linear function such as rectified linear unit, hyperbolic tangent or others. The model parameters

and latent factors are learned in a joint manner, where the latent factors are non-linear representations of user and item vectors from rating matrix .

Deep matrix factorizations [29] model the interaction function as an inner product . However, both user and item features are deep representations of the form: and .

Iii Proposed Model

In order to learn incomplete ratings, we formulate the following multilayer semi-NMF model (NSNMF): , where is the rating matrix, is the bias matrix, is the latent user preferences matrix and is the matrix of the non-linear item latent features. To model non-linear representation of the items, we use the following model , which finally leads to:


where is the hidden latent representation of items, is the weighting matrix between latent representations on different levels and is the element-wise non-linear function to better approximate the non-linear manifolds of the latent features.

Similar architecture to our proposed method without bias term is used in the task of learning deep representation of images [27], where the learning depends on dense multiplicative updates.

Simply, we model the interaction function as an inner product with offset , where user feature is the -th row of matrix and item non-linear representation is the -th column of . This model falls into the category of semi-NMF factorization models [4]. Semi-NMF is a variant of NMF, which imposes non-negativity constraints only on the latent factors of the second layer . This allows both positive and negative offsets from the bias term . Usually, in clustering represents cluster centroids and represents soft membership indicator [4]. But, from recommender perspective, matrix

may be interpreted as the linear regression coefficients and the nonnegativity constraints imposed on latent item attributes

allow part-based representations [20, 21].

Note, that much of the observed variation in rating values is due to the effects associated with either users or items, known as biases or intercepts, independent of any interactions. Thus, instead of explaining the full rating value by an interaction of the form , the system can try to identify the portion of these values that individual user or item biases cannot explain. Bias involved in a rating can be computed as follows: , where is the overall average rating, is the observed deviations of user u and is the observed deviations of item i.

Note, that in general case, we are able to compose more non-linearities with the following relation , which generates the following model . However, more complex item latent representation were showed not to be useful, see experiments section for more details.

Iii-a Learning Model Parameters

For a two layered item features structure, the model parameters are updated through an element-wise gradient descent approach, minimizing eq.(2), with squared loss function and L2 regularization .

The model parameters are randomly initialized uniformly at the range [0,1], and we perform iterative updates for each observed rating as follows:


(update only if


(update only if ,
where g’(.)

is the derivative of activation function and

is the error term. Note, that we do not explicitly store the dense matrix . The computational complexity for training a 2-layer item-feature NSNMF architecture is of order , where k,l are the dimensions of layer , and t the number of iterations. The learning rate was configured with the AdaGrad method [5], performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data, as in our case of incomplete ratings. Given k,l are constant and , the scalability of the proposed method linearly depend on the number of users and items.

Iv Experimental evaluation

In this section, we report experimental results by evaluating our NSNMF approach and a variety of baselines in both supervised and unsupervised tasks.

Iv-a Datasets

We have three real datasets as follows:

Each dataset is split into training (), and testing sets (

). The training set is then used for 10-fold cross-validation for hyperparameter tuning.

Iv-B Baselines and setup

We use baselines including both linear and nonlinear approaches as follows:

CF Neighborhood models are the most common approach to CF with user-oriented and item-oriented methods [15, 16]. They are respectively referred to as User-User CF and Item-Item CF.

SVD is applied in the collaborative filtering domain by factorizing the user-item rating matrix [17] by updating only for the known ratings.

NMF Nonnegative matrix factorization (NMF) which introduces the nonnegative constraint into a MF process [12, 30] along with the regularized NMF mode to avoid over-fitting.

RBM Restricted Boltzmann Machine (RBM) [14] is an undirected graphical model, which contains a layer of visible softmax units for items and a hidden binary unit for user rating. Each hidden unit could then learn to model a significant dependency between the ratings of different movies.

DMF presents a deep structure learning architecture to learn deep low dimensional representations respectively for users and items [29]. They use both explicit ratings and implicit feedback for a better optimization of a normalized cross entropy loss function to predict scaled ratings on a continuous scale [0,1].

As for our proposed NSNMF method, we evaluate it with different activation functions as follows: (i)


is the NSNMF model with Rectified Linear unit: as the activation function, (ii) NSNMF SoftPlus uses softplus: as the activation function and (iii) NSNMF ReLU_bias is the proposed model with rectified linear unit activation function plus bias.

In the supervised task, since we focus on explicit ratings, the rooted mean square error is used to assess the rate prediction performance [7]: . The less the value of RMSE, the better the approach performs.

In the unsupervised task, we aim to inspect the difference in representations obtained by our NSNMF and baseline approaches. We choose an unsupervised clustering task on such representations and thus use the pooled within-cluster sum of squares around the cluster means (WCSS) [25]: , where denotes the number of elements inside of cluster and is the euclidean distance between instances and within the same cluster.

Iv-C Supervised tasks

In the supervised task, we perform 10-fold cross-validation error for each dataset to determine the dimensions of the hidden representation and regularizing parameter for each approach. The dimensions of hidden representation were determined from cross-validation result for values in {4,6,8,10,15,20}. The learning rate and regularizer value were varied in the range of {0.1, 0.01, 0.001}.

The final models were trained with learning rate 0.01 and regularizing parameter 0.1 and factors value of 4,6,8 for FilmTrust, AMusic and MovieLens datasets respectively.

Then, we report the rate prediction RMSE errors in Table I.

In Table I, we observe that NSNMF based approaches, i.e. NSNMF ReLU, Softplus and ReLU_bias outperformed baselines across all datasets. Especially, NSNMF with ReLU_bias performed the best and achieved up to

less RMSE. Meanwhile, DMF which learns non-linear representations has lower RMSE than the baselines based on linear transformation i.e. User-User CF, Item-Item CF, SVD, NMF and regularized NMF in most of the time. We trained

444https://github.com/RuidongZ/Deep_Matrix_Factorizatio_Models DMF [29] using normalized cross entropy loss on both implicit and explicit ratings. The predicted ratings on the scale [0,1], when scaled to original scale [0,max(R)], where max(R) denotes the max score in all ratings, perform worse than baselines compared to real ratings on the same scale with RMSE measure. Thus, we use their DMF architecture, trained using squared loss function to predict unscaled ratings, which are then evaluated again with RMSE.

Algorithm FilmTrust ML100K AMusic
User-User CF 0.963 1.005 1.011
Item-Item CF 0.822 1.001 0.934
SVD 1.006 1.018 2.024
NMF 0.845 0.954 1.001
Regularized NMF 0.840 0.937 0.975
RBM 0.918 1.008 1.104
DMF 0.821 0.948 0.946
NSNMF ReLU 0.816 0.904 0.889

NSNMF Softplus
0.804 0.896 0.871

0.788 0.887 0.836

Furthermore, we trained our model with different numbers of hidden layers to assess the prediction errors on all three datasets. We found that 2-layer architecture better models the variation in the rating matrix, while deeper layers even decrease the performance. Due to the page limitation, we report the results up to 3 layers in Table II.

Algorithm FilmTrust ML100K AMusic
ReLu_2 layer 0.816 0.904 0.889
ReLu_3 layer 0.842 0.938 0.932

TABLE II: Test RMSE of NSNMF w.r.t. different number of layers

Iv-D Unsupervised task

In this part, we perform an unsupervised K-means clustering method to evaluate the item representation learned by different approaches in latent spaces. We performed each approach with the hyperparameter set via cross-validation and then obtain the derived representation.

In Figure 1, we report the WCSS [25] of each approach w.r.t. the number of clusters. We observe that our NSNMF ReLu and DMF constantly yield lower WCSS than NMF. It suggests that representation derived by non-linear matrix factorization demonstrates higher representation ability. The WCSS of NSNMF ReLu and DMF are quite comparable. It indicates that non-linear transformation is the dominant part while the way of the combination of such representation results in a minor difference in the derived representation. Moreover, the simple linear combination of non-linear representation leads to better generalization ability in supervised prediction, which is already demonstrated in Table I.

Fig. 1: The pooled within-cluster sum of squares around the cluster means (WCSS) of clustering on MovieLens100k Dataset (where 3d, 8d denotes the dimension of feature space)

V Discussion and Future Work

In this paper, we focus on learning the non-linear item representations for the explicit feedback and left the extension of learning non-linear implicit feedback representations for future work. Most of the deep learning architectures have been implemented using dense implicit feedback rating matrix. In this paper, we implement the proposed architecture for explicit feedback only, and left the implementation on implicit feedback for future work. We believe it will be interesting to see the performance of the proposed algorithm on implicit feedback, which will provide a better comparison with the deep learning algorithms [6, 28] that are training only on implicit feedback. Thus, the current paper includes the deep learning methods that use explicit feedback in their training algorithms[14, 29].

We find out that simple linear regression over non-linear item representations is sufficient to overcome the performance of other deep learning methods that use explicit feedback in their training algorithms [14, 29]. It is important to stress, that in our model the linear regression and non-linear item representations are learned in a joint manner via non-linear semi non-negative matrix factorization. The non-negative constraint allows better interpretability of item features e.g. movie cannot have a negative number of certain actors, a negative indication to certain genre etc. However, the semi non-negativity constraint allows the regression coefficients to become negative e.g. negative relation to certain item features.

Furthermore, the linear interaction of non-linear item features provides better predictions than the combination of non-liner item and non-linear user features, as in the case of Deep Matrix Factorization model [29].

Vi Conclusions

We introduced a multilayer nonlinear semi-nonnegative matrix factorization method to learn from the incomplete rating matrix. The multilayer approach, which automatically learns the hierarchy of attributes of the items, as well as the non-negative constraint help in the better interpretation of these factors. Furthermore, we presented an algorithm for optimizing the factors of our architecture with different non-linearities. We evaluate our approach in comparison to a variety of matrix factorization and deep learning baselines using both supervised rate prediction and unsupervised clustering in latent item space. The results offer the insights as follows: (i) simple linear combination of non-linear representations realized in our proposed approach achieves better generalization ability, that is, lower errors in the prediction on hold-out datasets. (ii) in the unsupervised clustering task, we find out that the representations learned by our approach yield comparable clustering performance metric (within-cluster sum of squares) as deep matrix factorization.


N.A.-F. and T.G. are grateful for financial support from the EU Horizon 2020 project SoBigData under grant agreement No. 654024. The authors acknowledge D. Tolic for useful directions regarding Deep Semi-NMF approach in the early stage of work.



  • [1] Bengio, Yoshua, Aaron Courville, and Pascal Vincent. ”Representation learning: A review and new perspectives.” IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
  • [2] Bobadilla, Jesús, et al. ”Recommender systems survey.” Knowledge-based systems 46 (2013): 109-132.
  • [3]

    Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank. In Proceedings of the 24th international conference on Machine learning - ICML ’07. ACM Press.

  • [4] Ding, Chris HQ, Tao Li, and Michael I. Jordan. ”Convex and semi-nonnegative matrix factorizations.” IEEE transactions on pattern analysis and machine intelligence 32.1 (2010): 45-55.
  • [5]

    Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research 12.Jul (2011): 2121-2159.

  • [6] He, Xiangnan, et al. ”Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • [7] Herlocker, Jonathan L., et al. ”Evaluating collaborative filtering recommender systems.” ACM Transactions on Information Systems (TOIS) 22.1 (2004): 5-53.
  • [8] Karatzoglou, Alexandros, and Balázs Hidasi. ”Deep learning for recommender systems.” Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 2017.
  • [9] Koren, Yehuda, Robert Bell, and Chris Volinsky. ”Matrix factorization techniques for recommender systems.” Computer 8 (2009): 30-37.
  • [10] Y. Koren R. Bell F. Ricci L. Rokach B. Shapira P. B. Kantor ”Advances in collaborative-filtering” in Recommender Systems Handbook New York NY USA:Springer pp. 145-186 2011.
  • [11] Li, Sheng, Jaya Kawale, and Yun Fu. ”Deep collaborative filtering via marginalized denoising auto-encoder.” Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015.
  • [12] Luo, Xin, et al. ”An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems.” IEEE Transactions on Industrial Informatics 10.2 (2014): 1273-1284.
  • [13] Ning, Xia, and George Karypis. ”Slim: Sparse linear methods for top-n recommender systems.” 2011 11th IEEE International Conference on Data Mining. IEEE, 2011.
  • [14] Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. ”Restricted Boltzmann machines for collaborative filtering.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.
  • [15] Sarwar, Badrul, et al. ”Item-based collaborative filtering recommendation algorithms.” Proceedings of the 10th international conference on World Wide Web. ACM, 2001.
  • [16] Karypis, George. ”Evaluation of item-based top-n recommendation algorithms.” Proceedings of the tenth international conference on Information and knowledge management. ACM, 2001.
  • [17] Sarwar, Badrul, et al. Application of dimensionality reduction in recommender system-a case study. No. TR-00-043. Minnesota Univ Minneapolis Dept of Computer Science, 2000.
  • [18] Schafer, J. Ben, Joseph A. Konstan, and John Riedl. ”E-commerce recommendation applications.” Data mining and knowledge discovery 5.1-2 (2001): 115-153.
  • [19] Smyth, Barry. ”Case-based recommendation.” The adaptive web. Springer, Berlin, Heidelberg, 2007. 342-376.
  • [20] Lee, Daniel D., and H. Sebastian Seung. ”Learning the parts of objects by non-negative matrix factorization.” Nature 401.6755 (1999): 788.
  • [21] Lee, Daniel D., and H. Sebastian Seung. ”Algorithms for non-negative matrix factorization.” Advances in neural information processing systems. 2001.
  • [22] Steck, Harald. ”Training and testing of recommender systems on data missing not at random.” Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010. Y. Koren R. Bell F. Ricci L. Rokach B. Shapira P. B. Kantor ”Advances in collaborative-filtering
  • [23] Adomavicius, Gediminas, and YoungOk Kwon. ”Improving aggregate recommendation diversity using ranking-based techniques.” IEEE Transactions on Knowledge and Data Engineering 24.5 (2012): 896-911.
  • [24]

    Su, Xiaoyuan, and Taghi M. Khoshgoftaar. ”A survey of collaborative filtering techniques.” Advances in artificial intelligence 2009 (2009).

  • [25] Tibshirani, Robert, Guenther Walther, and Trevor Hastie. ”Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.
  • [26]

    Tolic, Dijana, Nino Antulov-Fantulin, and Ivica Kopriva. ”A Nonlinear Orthogonal Non-Negative Matrix Factorization Approach to Subspace Clustering.” Pattern Recognition 82 (2018): 40-55.

  • [27] Trigeorgis, George, et al. ”A deep matrix factorization method for learning attribute representations.” IEEE transactions on pattern analysis and machine intelligence 39.3 (2017): 417-429.
  • [28] Wang, Hao, Naiyan Wang, and Dit-Yan Yeung. ”Collaborative deep learning for recommender systems.” Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
  • [29] Xue, Hong-Jian, et al. ”Deep Matrix Factorization Models for Recommender Systems.” IJCAI. 2017.
  • [30] Zhang, Sheng, et al. ”Learning from incomplete ratings using non-negative matrix factorization.” Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006.