Learning about users’ preference and making recommendations for them is of great importance in e-commerce, targeted advertising and web search. Recommendation is rapidly becoming one of the most successful applications of data mining and machine learning. The goal of a Top-recommendation algorithm is to produce a length- list of recommended items such as movies, music, and so on. Over the years, a number of algorithms have been developed to tackle the Top- recommendation problem 
. They make predictions based on the user feedback, for example, purchase, rating, review, click, check-in, etc. The existing methods can be broadly classified into two classes: content-based filtering and collaborating filtering (CF)   .
Content-based filtering: in this approach, features or descriptions are utilized to describe the items and a user profile or model is built using the past item rating to summarize the types of items this user likes . This approach is based on an underlying assumption that liking a feature in the past leads to liking the feature in the future. Some disadvantages of this approach are: If the content does not contain enough information to discriminate the items, then the recommendation will not be accurate; When there is not enough information to build a solid model for a new user, the recommendation will also be jeopardized.
Collaborating filtering: in this approach, user/item co-rating information is utilized to build models. Specifically, CF relies on the following assumption: if a user A likes some items that are also liked by another user B, A is likely to share the same preference with B on another item . One challenge for CF algorithms is to have the ability to deal with highly sparse data, since users typically rate only a small portion of the available items.
In general, CF methods can be further divided into two categories: nearest-neighborhood-based methods and model-based methods. The first class of methods compute the similarities between the users/items using the co-rating information and new items are recommended based on these similarities . One representative method of this kind is Item-based k-nearest-neighbor (ItemKNN) . On the other hand, model-based methods employ a machine learning algorithm to build a model, which is then used to perform the recommendation task . This model learns the similarities between items or latent factors that explain ratings. For example, matrix factorization (MF) method uncovers a low-rank latent structure of data, approximating the user-item matrix as a product of two factor matrices.
Matrix factorization is popular for collaborative prediction and many works are based on it. For instance, pure singular-value-decomposition-based (PureSVD)
MF method represents users and items by the most principal singular vectors of the user-item matrix; weighted regularized matrix factorization (WRMF) method deploys a weighting matrix to discriminate between the contributions from observed purchase/rating activities and unobserved ones.
Recently, a novel Top- recommendation method has been developed, called LorSLIM , which has been shown to achieve good performance on a wide variety of datasets and outperform other state-of-the-art approaches. LorSLIM improves upon the traditional item-based nearest neighbor CF approaches by learning directly from the data, a sparse and low-rank matrix of aggregation coefficients that are analogous to the traditional item-item similarities. It demonstrates that low-rank requirement on the similarity matrix is crucial to improve recommendation quality. Since the rank function can hardly be used directly, the nuclear norm  is adopted as a convex relaxation of the matrix rank function in LorSLIM. Although the nuclear norm indeed recovers low-rank matrices in some scenarios , some recent work has pointed out that this relaxation may lead to poor solutions    . In this paper, we propose a novel relaxation which provides a better approximation to the rank function than the nuclear norm. By using this new approximation in LorSLIM model, we observe significant improvement over the current methods. The main contributions of our paper are as follows:
We introduce a novel matrix rank approximation function, whose value can be very close to the real rank. This can be applied in a range of rank minimization problems in machine learning and computer vision.
An efficient optimization strategy is designed for this associated nonconvex optimization problem, which admits a closed-form solution to every subproblem.
As an illustration, we perform experiments on six real datasets. It indicates that our Top- recommendation approach considerably outperforms the state-of-the-art algorithms which give similar performances on most datesets. Thus this fundamental enhancement is due to our better rank approximation.
The remainder of this paper is organized as follows. In Section 2, we give some notations. Section 3 describes related work. Section 4 introduces the proposed model. In Section 5, we describe our experimental framework. Experimental results and analysis are presented in Section 6; Section 7 draws conclusions.
2 Notations and Definitions
Let and represent the sets of all users and all items, respectively. The entire set of user-item purchases/ratings is to be represented by user-item matrix of size . The value of is 1 or a positive value if user has ever purchased/rated item ; otherwise it is . , the -th row of , denotes the purchase/rating history of user on all items. The -th column of denoted as is the purchase/rating history of all users on item . The aggregation coefficient matrix is represented as of size . is a size- column vector of aggregation coefficients. is the -norm of . denotes the squared Frobenius norm of . The nuclear norm of is , where is the -th singular value of . The unit step function has value 1 for and 0 if . The rank of matrix is . We use to denote the vector of all singular values of in non-increasing order. Moreover,
denotes the identity matrix.
In this paper, we denote all vectors (e.g., , ) with bold lower case letters. We represent all matrices (e.g. , ) with upper case letters. A predicted value is represented by having a mark.
3 Relevant Research
Recently, an interesting Top- recommendation method, sparse linear methods (SLIM) has been proposed  which generates recommendation lists by learning a sparse similarity matrix. SLIM solves the following regularized optimization problem:
where the first term measures the reconstruction error, enforces the sparsity on , and the second and third terms combine the sparsity-inducing property of with the smoothness of , in a way similar to the elastic net . The first constraint is intended to ensure that the learned coefficients represent positive similarities between items, while the second constraint is applied to avoid the trivial solution in which is an identity matrix, i.e., an item always recommends itself. It has been shown that SLIM outperforms other Top- recommendation methods. A drawback of SLIM is that it can only model relations between items that have been co-purchased/co-rated by at least one user . Therefore, it fails to capture the potential dependencies between items that have not been co-rated by at least one user, while modeling relations between items that are not co-rated is essential for good performance of item-based approaches in sparse datasets.
To address the above issue, LorSLIM  further considers the low-rank structure of . This idea is inspired by the factor model, which assumes that a few latent variables are responsible for items’ features and the coefficient matrix factors, , with being of low-rank. Finally, together with sparsity, it constructs a block diagonal , i.e., the items have been classified into many smaller ”clusters” or categories. This situation happens frequently in real life such as movies, music, books and so on. Therefore, this model promotes the recommendation precision further.
In LorSLIM, the nuclear norm is utilized as a surrogate for the rank of . By comparing with , we can see that when the singular values are much larger than 1, the nuclear norm approximation deviates from the true rank markedly. The nuclear norm is essentially an -norm of the singular values and it is well known that
-norm has a shrinkage effect and leads to a biased estimator . Recently, some variations of the nuclear norm have been studied, e.g., some of the largest singular values are subtracted from the nuclear norm in truncated nuclear norm regularization ; a soft thresholding rule is applied to all singular values in singular value thresholding algorithm ; some generalized nonconvex rank approximations have been investigated in  . In some applications, they show good performance; however, these models are either overly simple or only restricted to some specific applications.
In this paper, we develop a more general approach, which directly approximates the rank function with our formulation and optimization. Then we show that better rank approximation can improve the recommendation accuracy substantially.
4 Proposed Framework
4.1 Problem Setup
In this paper, we propose the following continuous function to replace the unit step function in the definition of the rank function:
There are several motivations behind this formulation. First, it attenuates the contributions from large singular values significantly, thus overcomes the imbalanced penalization of different singular values. Second, by defining , is differentiable and concave in . Third, is unitarily invariant. The last two properties facilitate subsequent optimization and computation much. Compared to many other approaches   , this formulation enjoys simplicity and efficacy.
Since (3) is a nonconvex problem, it is hard to solve directly. We introduce auxiliary variables to make the objective function separable and solve the following equivalent problem:
This can be solved by using the augmented Lagrange multiplier (ALM) method . We turn to minimizing the following augmented Lagrangian function:
where is the penalty parameter and , , are the Lagrange multipliers. This unconstrained problem can be minimized with respect to , , and alternatively, by fixing the other variables, and then updating the Lagrange multipliers , , and . At the th iteration,
We can see that the objective function of (5) is quadratic and strongly convex in , which has a closed-form solution:
For minimization, we have
which can be solved by the following lemma . For and , the solution of the problem
is given by , which is defined component-wisely by
Therefore, by letting , we can solve element-wisely as below:
To update , we have
This can be solved with the following theorem.  If is a unitarily invariant function, , and whose SVD is and , then the optimal solution to the following problem
is with SVD being , where is obtained through the Moreau-Yosida operator , defined as
In our case, the first term in (11) is concave while the second term is convex in , so we can resort to the difference of convex (DC)  optimization strategy. A linear approximation is applied at each iteration of DC programing. For this inner loop, at the th iteration,
where is the gradient of at and is the SVD of . Finally, it converges to a local optimal point . Then .
To update , we need to solve
which yields the updating rule
Here max is an element-wise operator. The complete procedure is outlined in Algorithm 1.
Input: Original data matrix , parameters , , , .
Initialize: as -by- matrices with random numbers between 0 and 1, .
UNTIL stopping criterion is met.
5 Experimental Evaluation
The “#users”, “#items”, “#trns” columns show the number of users, number of items and number of transactions, respectively, in each dataset. The “rsize” and “csize” columns show the average number of ratings of each user and of each item, respectively, in each dataset. Column corresponding to “density” shows the density of each dataset (i.e., density=#trns/(#users#items)). The “ratings” column is the rating range of each dataset with granularity 1.
We evaluate the performance of our method on six different real datasets whose characteristics are summarized in Table 1. These datasets represent different applications of a recommendation algorithm. They can be broadly categorized into two classes.
The first class contains Delicious, lastfm and BX. These three datasets have only implicit feedback, i.e., they are represented by binary matrices. Specifically, Delicious was the bookmarking and tagging information of 2 users in Delicious social bookmarking system111http://www.delicious.com, in which each URL was bookmarked by at least 3 users. Lastfm represents music artist listening information extracted from the last.fm online music system222http://www.last.fm, in which each music artist was listened to by at least 10 users and each user listened to at least 5 artists. BX is a part of the Book-Crossing dataset333http://www.informatik.uni-freiburg.de/~cziegler/BX/ such that only implicit interactions were contained and each book was read by at least 10 users.
The second class contains ML100K, Netflix and Yahoo. All these datasets contain multi-value ratings. Specifically, the ML100K dataset contains movie ratings and is a subset of the MovieLens research project444http://grouplens.org/datasets/movielens/. The Netflix is a subset of Netflix Prize dataset555http://www.netflixprize.com/ and each user rated at least 10 movies. The Yahoo dataset is a subset obtained from Yahoo!Movies user ratings666http://webscope.sandbox.yahoo.com/catalog.php?datatype=r. In this dataset, each user rated at least 5 movies and each movie was rated by at least 3 users.
The parameters for each method are described as follows: ItemKNN: the number of neighbors ; PureSVD: the number of singular values and the number of SVD; WRMF: the dimension of the latent space and its weight on purchases; BPRKNN: its learning rate and regularization parameter ; BPRMF: the latent space’s dimension and learning rate; SLIM: the -norm regularization parameter and the -norm regularization coefficient ; LorSLIM: the -norm regularization parameter , the -norm regularization parameter , the nuclear norm regularization coefficient and the auxiliary parameter . Our: the -norm regularization parameter , the rank regularization parameter and the auxiliary parameter . in this table is 10. Bold numbers are the best performance in terms of HR and ARHR for each dataset.
5.2 Evaluation Methodology
To examine the effectiveness of the proposed method, we follow the procedure in  and adopt 5-fold cross validation. For each fold, a dataset is split into training and test sets by randomly selecting one non-zero entry for each user and putting it in the test set, while using the rest of the data for training the model777We use the same data as in , with partitioned datasets kindly provided by its first author.. Then a ranked list of size- items for each user is produced. We then evaluate the model by comparing the ranked list of recommended items with the item in the test set. In the following results presented in this paper, is equal to 10 by default.
The recommendation quality is evaluated by the Hit Rate (HR) and the Average Reciprocal Hit Rank (ARHR) . HR is defined as
where #hits is the number of users whose item in the testing set is contained (i.e., hit) in the size- recommendation list, and #users is the total number of users. An HR value of 1.0 means that the algorithm is able to always recommend hidden items correctly, whereas an HR value of 0.0 indicates that the algorithm is not able to recommend any of the hidden items.
A drawback of HR is that it treats all hits equally without considering where they appear in the Top- list. ARHR addresses this by rewarding each hit based on its place in the Top- list, which is defined as:
where is the position of the item in the ranked Top- list for the -th hit. In this metric, hits that occur earlier in the ranked list are weighted higher than those occur later, and thus ARHR indicates how strongly an item is recommended. The highest value of ARHR is equal to HR which occurs when all the hits occur in the first position, and the lowest value is equal to HR/ when all the hits occur in the last position of the list.
HR and ARHR are recommended as evaluation metrics since they directly measure the performance based on the ground truth data, i.e., what users have already provided feedback.
5.3 Comparison Algorithms
We compare the performance of the proposed method with seven state-of-the-art Top- recommendation algorithms, including the item neighborhood-based collaborative filtering method ItemKNN , two MF-based methods PureSVD  and WRMF , SLIM  and LorSLIM . We also examine two ranking/retrieval criteria based methods BPRMF and BPRKNN , where Bayesian personalized ranking (BPR) criterion is used which measures the difference between the rankings of user-purchased items and the remaining items.
6.1 Top-N Recommendation Performance
We summarize the experimental results of different methods in Table 2. It shows that our algorithm performs the best among all methods across all the datasets888Codes of our algorithm can be found at https://github.com/sckangz/SDM16. Specifically, in terms of HR, our method outperforms ItemKNN, PureSVD, WRMF, BPRKNN, BPRMF, SLIM and LorSLIM by 40.41%, 47.22%, 34.65%, 27.99%, 36.01%, 25.67%, 11.66% on average, respectively, over all the six datasets; with respect to ARHR, the average improvements across all the datasets for ItemKNN, PureSVD, WRMF, BPRKNN, BPRMF, SLIM and LorSLIM are 45.79%, 56.38%, 45.43%, 34.25%, 46.71%, 29.41%, 11.23%, respectively. This suggests that a closer rank approximation than the nuclear norm is indeed crucial in real applications.
Among seven other algorithms, LorSLIM is a little better than the others. SLIM, BPRMF, and BPRKNN give similar performance. For the three MF-based methods, BPRMF and WMF are better than PureSVD except on lastfm and ML100K. It is interesting to note that the simple itemKNN performs better than BPRMF on Netflix and Yahoo. This could be because in BPRMF , the entire AUC curve is used to measure if the interested items are ranked higher than the rest. However, a good AUC value may not lead to good performance for Top- recommendation .
6.2 Recommendation for Different Top-N
We show the performance of these algorithms for different values of (i.e., 5, 10, 15, 20 and 25) on all six datasets in Figure 1. It shows that our algorithm outperforms other methods significantly in all cases. Once again, it demonstrates the importance of good rank approximation.
6.3 Matrix Reconstruction
We use ML100K to show how LorSLIM and our method reconstruct the user-item matrix. The density of ML100K is 6.30% and the mean for those non-zero elements is 3.53. The reconstructed matrix from LorSLIM has a density of 13.61%, whose non-zero values have a mean of 0.046. For those 6.30% non-zero entries in , recovers 70.68% of them and their mean value is 0.0665. In contrast, our proposed algorithm recovers all zero values. The mean of our reconstructed matrix is 0.236. For those 6.30% non-zero entries in , it gives a mean of 1.338. These facts suggest that our method better recovers than LorSLIM can do. In other words, LorSLIM loses too much information. This appears to explain the superior performance of our proposed method.
6.4 Parameter Effects
Our model involves parameters , . We also introduce an auxiliary parameter in ALM algorithm. Some previous studies have pointed out that a dynamical is preferred in practice. Hence we increase at a rate of with a value 1.1, which is a popular choice in the literature. For each possible combination of , , we can use grid search to find the optimal initial value .
In Figure 2, we depict the effects of different , on dataset ML100K. As can be seen from it, our algorithm performs well over a large range of and . Compared to , the result is more sensitive to . The performance keeps increasing as increase when it is small, then decreases as it become larger. This is because the -norm parameter controls the sparsity of the aggregating matrix. If is too large, the matrix will be too sparse that nearly no item will be recommended since the coefficients with the target item are all zero.
Another important parameter is in our rank approximation, which measures how close of our rank relaxation to the true rank. Generally speaking, it is always safe to choose a small value, although can be big if the singular values are big or the size of matrix is big. If is too small, it may incur some numerical issues. Figure 3 displays the influence of on the rank approximation. It can be seen that can match the rank function closely when . For our previous experimental results, is applied, which results in an approximation error of .
In this paper, we propose a novel rank relaxation to solve the Top- recommendation problem. This approximation addresses the limitations of the nuclear norm by mimicing the behavior of the true rank function. We show empirically that this nonconvex rank approximation can substantially improve the quality of Top- recommendation. This surrogate for the rank function of a matrix may as well benefit a number of other problems, such as robust PCA and robust subspace clustering.
This work is supported by the U.S. National Science Foundation under Grant IIS 1218712.
-  F. Ricci, L. Rokach, and B. Shapira, Introduction to recommender systems handbook. Springer, 2011.
-  M. Balabanović and Y. Shoham, “Fab: content-based, collaborative recommendation,” Communications of the ACM, vol. 40, no. 3, pp. 66–72, 1997.
-  Q. Gu, J. Zhou, and C. H. Ding, “Collaborative filtering: Weighted nonnegative matrix factorization incorporating user and item graphs.” in SDM. SIAM, 2010, pp. 199–210.
-  F. Wang, S. Ma, L. Yang, and T. Li, “Recommendation on item graphs,” in Data Mining, 2006. ICDM’06. Sixth International Conference on. IEEE, 2006, pp. 1119–1123.
-  S. Zhang, W. Wang, J. Ford, and F. Makedon, “Learning from incomplete ratings using non-negative matrix factorization.” in SDM, vol. 6. SIAM, 2006, pp. 548–552.
-  M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” in The adaptive web. Springer, 2007, pp. 325–341.
-  C. Desrosiers and G. Karypis, “A comprehensive survey of neighborhood-based recommendation methods,” in Recommender systems handbook. Springer, 2011, pp. 107–144.
-  X. Ning and G. Karypis, “Slim: Sparse linear methods for top-n recommender systems,” in Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011, pp. 497–506.
-  M. Deshpande and G. Karypis, “Item-based top-n recommendation algorithms,” ACM Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 143–177, 2004.
Z. Kang, C. Peng, and Q. Cheng, “Top-n recommender system via matrix
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recommender algorithms on top-n recommendation tasks,” in Proceedings of the fourth ACM conference on Recommender systems. ACM, 2010, pp. 39–46.
-  R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang, “One-class collaborative filtering,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008, pp. 502–511.
-  Y. Cheng, L. Yin, and Y. Yu, “Lorslim: Low rank sparse linear methods for top-n recommendations,” in Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014, pp. 90–99.
-  B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
-  E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, pp. 717–772, 2009.
-  X. Shi and P. S. Yu, “Limitations of matrix completion via trace norm minimization,” ACM SIGKDD Explorations Newsletter, vol. 12, no. 2, pp. 16–20, 2011.
-  Z. Kang, C. Peng, and Q. Cheng, “Robust pca via nonconvex rank approximation,” in Data Mining (ICDM), 2015 IEEE International Conference on, Nov 2015, pp. 211–220.
-  Z. Kang and Q. Cheng, “Robust subspace clustering via tighter rank approximation,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 393–401.
-  N. Srebro and R. R. Salakhutdinov, “Collaborative filtering in a non-uniform world: Learning with the weighted trace norm,” in Advances in Neural Information Processing Systems, 2010, pp. 2056–2064.
-  H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
-  J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
-  C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, pp. 894–942, 2010.
-  Y. Hu, D. Zhang, J. Ye, X. Li, and X. He, “Fast and accurate matrix completion via truncated nuclear norm regularization,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 9, pp. 2117–2130, 2013.
-  J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
C. Lu, J. Tang, S. Yan, and Z. Lin, “Generalized nonconvex nonsmooth low-rank
Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 4130–4137.
-  C. Lu, C. Zhu, C. Xu, S. Yan, and Z. Lin, “Generalized singular value thresholding,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
-  M. Malek-Mohammadi, M. Babaie-Zadeh, and M. Skoglund, “Iterative concave rank approximation for recovering low-rank matrices,” Signal Processing, IEEE Transactions on, vol. 62, no. 20, pp. 5213–5226, 2014.
-  Z. Kang, C. Peng, and Q. Cheng, “Robust subspace clustering via smoothed rank approximation,” SIGNAL PROCESSING LETTERS, IEEE, vol. 22, no. 11, pp. 2088–2092, Nov 2015.
-  C. Peng, Z. Kang, H. Li, and Q. Cheng, “Subspace clustering using log-determinant rank approximation,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 925–934.
-  D. P. Bertsekas, “Nonlinear programming,” 1999.
-  A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
-  Z. Kang, C. Peng, J. Cheng, and Q. Cheng, “Logdet rank minimization with application to subspace clustering,” Computational Intelligence and Neuroscience, vol. 2015, 2015.
-  R. Horst and N. V. Thoai, “Dc programming: overview,” Journal of Optimization Theory and Applications, vol. 103, no. 1, pp. 1–43, 1999.
-  Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008, pp. 263–272.
-  S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2009, pp. 452–461.