Log In Sign Up

A Differentiable Ranking Metric Using Relaxed Sorting Opeartion for Top-K Recommender Systems

by   Hyunsung Lee, et al.

A recommender system generates personalized recommendations for a user by computing the preference score of items, sorting the items according to the score, and filtering the top-Kitemswith high scores. While sorting and ranking items are integral for this recommendation procedure,it is nontrivial to incorporate them in the process of end-to-end model training since sorting is non-differentiable and hard to optimize with gradient-based updates. This incurs the inconsistency issue between the existing learning objectives and ranking-based evaluation metrics of recommendation models. In this work, we present DRM (differentiable ranking metric) that mitigates the inconsistency and improves recommendation performance, by employing the differentiable relaxation of ranking-based evaluation metrics. Via experiments with several real-world datasets, we demonstrate that the joint learning of the DRM cost function upon existing factor based recommendation models significantly improves the quality of recommendations, in comparison with other state-of-the-art recommendation methods.


page 1

page 2

page 3

page 4


NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting

Learning to Rank (LTR) algorithms are usually evaluated using Informatio...

DiPS: Differentiable Policy for Sketching in Recommender Systems

In sequential recommender system applications, it is important to develo...

Optimizing Differentiable Relaxations of Coreference Evaluation Metrics

Coreference evaluation metrics are hard to optimize directly as they are...

PiRank: Learning To Rank via Differentiable Sorting

A key challenge with machine learning approaches for ranking is the gap ...

Offline Retrieval Evaluation Without Evaluation Metrics

Offline evaluation of information retrieval and recommendation has tradi...

Ranking metrics on non-shuffled traffic

Ranking metrics are a family of metrics largely used to evaluate recomme...

SoDeep: a Sorting Deep net to learn ranking loss surrogates

Several tasks in machine learning are evaluated using non-differentiable...

1 Introduction

With the massive growth of online content nowadays, it becomes common for many online content providers to equip recommender systems (RS) to give personalized recommendations in order to facilitate better user experiences and alleviate the dilemma of choices [27]. In general, recommender systems generate relevance scores of items and select K items of the top-K highest relevance scores; thus, sorting and ranking operations are involved the top- recommendation tasks.

As pointed out in several research works [2, 34, 14], optimizing objectives which are not aware of ranking nature of the top- recommendation tasks, does not always guarantee to maximize the ranking-based objectives. Still, most recommendation models exploit other surrogate objectives, such as mean squared error and negative log likelihood, because incorporating sorting operation into end-to-end model training is a challenging attempt. Those difficulties mostly derive from the non-differentiable property of sorting operation.

Although there exist ranking oriented objectives: pairwise objectives such as Bayesian Personalized Pairwise Loss[24], and listwise objectives based on Placket-Luce distribution [32, 14], both objectives are not consistent with the top- recommendation task because pairwise objectives only consider pairwise rankings of two items, and listwise objectives consider the entire ranking of items.

To bridge the inconsistency between learning objectives used to learn models and top- recommendation tasks, we present the differentiable ranking metrics (DRM), which is the continuous relaxation of ranking-based evaluation metrics such as precision or normalized discounted cumulative gain (NDCG). By employing the differentiable relaxation of sorting operator introduced in [9], for factor-based RS models, the DRM objective expedites optimization of given ranking-based metrics.

We first reformulate ranking-based metrics in terms of permutation matrix arithmetic forms and then relax a nondifferentiable permutation matrix to a differentiable row-stochastic matrix. This reformulation allows us to represent nondifferentiable ranking metrics in a differentiable form of DRM. It can be used to fit recommendation models with gradient based updates. DRM is inherently consistent with the ranking-based metrics commonly used for evaluating recommenders such as Precision and Recall. Moreover, it can be readily incorporated onto existing RS models without modifying their structures.

We adopt DRM for two state-of-the-art factor-based models, WARP [31] and CML [12], to evaluate the effect of DRM upon those existing models, in comparison with several other recommendation models. Our experiments demonstrate that DRM objective significantly improves the quality of top- recommendations on several real-world datasets.

The main contributions of this paper are summarized as:

  • We propose DRM (Differentiable Ranking Metrics) objective that alleviates the misalignment between training costs and validation metrics in top- recommendation models.

  • Via experiments with the real-world datasets, we show our approach, the joint learning that incorporates DRM in model optimization outperforms several state-of-the-art RS methods.

2 Preliminaries

Given a set of users , a set of items , and a set of interactions for all users in and all items in , a recommender model aims to learn to predict preference, or score of user to item . We use binary implicit feedback such that if user has interacted with item , and 0 otherwise. Note that in this work, we only consider this binary feedback format, while our approach can be easily generalized to various implicit feedback settings. We use to index a user, and and to index items, usually is for items that user has interacted, and is items that user did not interact. We denote a set of items with which user has interacted as . We also use to represent the items that user

has interacted, in the bag of words notation, meaning column vector

. Similarly, we use to the vector of predicted scores of items, meaning .

2.1 Cost Functions of Recommendation Models

Like other machine learning models, recommender models require an objective function to optimize. Objectives for recommender models are grouped into three categories: pointwise, pairwise, and listwise.

Pointwise objectives maximize the accuracy of predictions independently. Mean squared error and Cross entropy are commonly used pointwise objectives for training machine learning models for recommenders. It is known that pointwise objectives for recommendation have a limitation in that high predictive accuracy does not always lead to high-quality recommendations [6].

Pairwise objectives gained popularity because they are more closely related to the top- recommendation tasks than pointwise objectives. It enables a recommendation model to learn users’ preferences by viewing the problem as a binary classification, predicting whether user prefer item to item . As noted in [3], One of the main concerns on pairwise approaches is that it is formalized to minimize classification errors of item pairs, rather than errors of item rankings.

Listwise objectives minimize errors in the list of sorted items or scores of items. They have been explored by a few prior works [28, 14, 32], yet they are not fully investigated in the recommender systems domain. It is because list operations such as permutation or sorting are hard to be differentiated. One significant drawback of listwise objectives is that they treat all items of rankings with equal importance. However, items at higher ranks can be recommended and are more important to the top- recommendation.

Our objective overcomes the limitations of the pairwise objectives and current listwise objectives while exploiting both ranking nature and emphasis on items at top ranks of the top- personalized recommendations.

2.2 Ranking Metrics for Recommendation Models

In general, a recommendation model generates a recommendation to the user as a list of items of the highest predicted scores for the user , excluding the items which already interacted with. In practice, its performance should be validated with respect to its given objectives before being deployed in target services.

In our notation, we represent the list of items ordered by the preference scores with respect to user as , and the item at rank as . In addition, we define the function that specifies whether the -th highest scored item for user in the recommendation list is in the validation dataset that contains all the items interacted by , i.e.,


where is the indicator function, yielding if the statement is true, and otherwise.

Among the evaluation metrics for the top- recommendations, Precision and Recall are two of the most widely used for evaluation [1]. For each user, , both metrics are based on how many items in the top- recommendation are in the validation dataset .

The precision specifies the fraction of hit items in the validation dataset among the items in the top- recommended list, while the recall specifies the fraction of recommended items among the items in the validation dataset . Notice that both metrics do not distinguish the relative ranking among the items in the recommendation list.

On the other hand, Discounted cumulative gain (DCG) [15] and Average precision (AP) take into account for the relative ranking of items by weighting the impact of according to its rank . Furthermore, NDCG@K specifies a normalized value of , which is divided by the ideal discount cumulative gain .

The truncated AP is defined as

AP can be viewed as a weighted sum of Hit for each rank weighted by .

Notice that all the metrics above are consistent in a common form of a weighted sum of Hit. Accordingly, we formulate these metrics in a unified way as conditioned on the weight function . For simplicity, we omit the arguments of these metrics without loss of generality.


3 Proposed Method

In this section, we propose DRM. We begin this section by introducing two working blocks for our method. The first part introduces matrix factorization with weighted hinge loss. Next, we introduce how to represent listwise metrics in terms of vector arithmetic and then relax these metrics into differentiable ones, which can be optimized by gradient descent. Finally, we describe and how to fit the model.

3.1 Factor Based Recommenders with Hinge Loss

Factor based recommenders represent users and items in a latent vector space , and then formulate the preference of user to item as a function of two vectors and in -dimensional vector space , for users and items. The dot product is one common method for mapping a pair of user and item vectors to a preference score [24, 26, 13]. In [12], the collaborative metric learning (CML) embeds users and items in the euclidean metric space and defines its score function as a negative value of L2 distance of two vectors.

where is the L2 norm of the vector .

Our model use either dot product or L2 distance of user vector of user and item vector of item as a score function. Regardless of score functions, we update our model using weighted hinge loss with weight are calculated by an approximated ranking of the positive item with respect to user .


where is a clamp function, and is the margin value for clamp function.

The weight is defined to be larger if the rank of positive item

is estimated to be at lower rank. Similar to sampling procedure in

[12], it is defined to be parallel to allow fast computations on GPU. Explicitly, is

With sampling items for each update from the set of items that the user did not interact with. The number of negative sampling is usually between ten to a few hundreds.

3.2 Relaxed Precision

An -dimensional permutation is a vector of distinct indices from to . Every permutation can be represented using a permutation matrix and its element can be described as:

For example, a permutation matrix maps a score vector to .

We can represent a sorting by decreasing order with the score vector and the permutation matrix as follow (Corollary 3. in [9]):


where refers to the column vector having for all elements and is the matrix such that . Note that the -th row of the permutation matrix is equal to the one-hot vector representation of the item of rank . Thus we can represent Hit(Eq. (1)) using the dot product of and .

Thus we obtain the representation of ranking metrics, or Eq. (2), in terms of vector arithmetic.


In [9], they propose a differntiable generalization of sorting by relaxation of the permutation matrix Eq. (4) into row-stochastic matrix, allowing differentiation operation involving sorting of elements of real values. We can construct this relaxed matrix by following equation. The -th row of the relaxed matrix is defined as:

where is a temperature parameter. Higher value of means each row of our relaxed matrix becomes flatter. This relaxation is continuous everywhere and differentiable almost everywhere with respect to the elements of . As , reduces to the permutation matrix .

We can obtain differentiable relaxed objective to optimize Eq. (5) using gradient based update, by simply replacing to . Explicitly,


Since softmax function is differentiable, this value is differentiable now.

We empirically find that this is slightly more stable to update model by the equation below:


where .

Note that minimizing Eq. (7) is equivalent to maximizing Eq. (6).

3.3 Model Update

Once we developed our objective in the section above, we incorporate our loss Eq. (7) into the model learning structure Eq. (3). We propose learning from two worlds by joint learning cost:


This is the objective of our model. We can also view this objective as regularizing the pairwise ranking objective (Eq. (3)) upon violations of correct rankings. The effect of is controlled using scaling parameter .

Interpretation. We have the gradient update rule of the loss (7) with respect to a score vector .



for all .

Since the -th diagonal of is a negative summation of the other elements in -th row of , we can treat

as a adjacency matrix which has a skew-adjacency part with items as nodes of the directed graph. If an item

has lower score than another one , then there is outcoming edge, which is of value , from to with respect to a rank between them, vise versa. In this point of view, we obtain an advantage such that can act on a ranking decision with only the information of score vector, during the calculation of item rank. Moreover, for each item rank , we can refer to that the -th item is sorted correctly if the -th diagonal of is 0. Therefore we can observe the propensity of ranking decision through the matrix , as a modified graph Laplacian.

As we consider factor based models with gradient updates using negative sampling [24, 12, 11, 16, 33] we follow similar sampling procedure with additional positive item sampling. One training sample contains a user , positive items that user has interacted with, and negative items that user did not interact with. We empirically set to be times of . We construct a list of items with positive item and negative items sampled to build size array is built where first elements are all and zero elsewhere. We can construct similarly

The learning procedure for our model is summarized at Alg. 1.

  Initialize user factors where
  Initialize item factors where
     Sample user from
     Sample items from
     Sample items from
     Choose one positive item with smallest score among
     Choose one negative item with largest score among
     for  do
     end for
     for  do
     end for
     for  do
        Update with using Adagrad Optimizer.
     end for
  until Converged
Algorithm 1 Learning Algorithm for

4 Related Works

Bayesian Personalized Ranking [24] has proposed pairwise cost function to maximize Area Under the Curve(AUC). This framework gives a method to learn personalized ranking for factor based recommenders and nearest neighbors based recommenders. One of the significant drawbacks of this model is that AUC does not discriminate between items in higher ranks and that in lower ranks, unlike NDCG and MAP. This property does not fit very well with the top- recommendation scenario. Our model, unlike BPR, focuses on a few items at higher ranks. This fits more in the actual recommendation scenario where only a small number of items can be recommended at a time, thus resulting in better top- recommendations.

Cofactor [19] has proposed word2vec[18, 22]-like embedding techniques to embed item co-occurrence information into the matrix factorization model. This is achieved by adding additional objective function with existing matrix factorization objectives. SRRMF [4] claims that merely treating missing ratings to be zeros leads suboptimal behaviors. It proposes smoothing negative feedback to nonzero values according to their approximated rank. These two works are the most close work with ours in that they propose a new objective or view to interpret data without requiring additional input such as context. However, they are limited in that their objectives cannot be applied to general gradient based models.

Listwise Collaborative Filtering [14]

attepmts to tackle the misalignment between cost and objective on memory based, KNN recommenders. It proposed a method to calculate similarty between two lists. Our work is orthogonal to it because we propose a solution for factor based, or model based recommenders.

5 Empirical Evaluation

In this section, we evaluate our proposed learning objective upon various existing recommendation models.

5.1 Experiment Setup

SketchFab Epinion ML-20M Melon
#users 16K 20K 133K 104K
#items 28K 59K 15K 81K
#interactions 447K 500K 8M 3.3M
avg. row 28.74 23.59 58.45 31.39
avg. col 15.52 8.52 514.54 40.44
density 0.10% 0.04% 0.38% 0.04%
concentration 35.63 38.89 65.75 38.06
domain 3D model Product Movie Music
Table 1: Dataset statistics. for each dataset, #users and #items denote the number of users and the number of items; #interactions denotes the number of transactions or clicks; avg. row and avg. col denote the average number of items that each user has interacted with, and the average number of users who have interacted with each item respectively; density denotes the interaction matrix density (i.e., density = #interactions / (#users #items); concentration denotes the proportion of interactions by the top 5% clicked items; domain describes the domain of datasets.

We evaluate our approach and baseline models with several datasets of real-world user-item interactions, described below. Their statistics and characteristics are summarized in Table 1.

SketchFab [25]: This dataset contains user click streams on 3D models. We view likes as a user-item interaction signal, and only consider items whose interaction count is no less than 5.

Epinion [29]: This dataset has product reviews and five-star rating information on a web commerce site. We view each rating as an user-item interaction signal.

ML-20M [10]: This dataset contains five-star ratings (with a half star) of items from various users. We interpret each rating as a user-item interaction. We exclude rating lower than four, and treat remainders as binary implicit feedback.

Melon 111 This dataset contains playlists in a music streaming service. To be consistent with the implicit user feedback setting, we treat each playlist as a user, and songs in a playlist as the items that a user has interacted with.

Evaluation Protocol

We randomly split interaction data into training, validation, and test datasets in a 70%, 10%, 20% portions, respectively. We first train models once using training data to find the best hyperparameter settings for each model with evaluation using the validation dataset. We then train models five times with best hyperparameter settings using both training and validation data, evaluate models on test data, and report the average of evaluation metrics. We skip evaluating users having less than three interactions in the training dataset. We use Recall@50 for model validation. We conduct Welch’s T-test 

[30] on results and denote results with value lower than as boldface, and ties in italic.

We report mean AP@10 (MAP@10), NDCG@10, Recall@50, and NDCG@50 as results with consideration of recent trend of two-stage recommender [8, 5, 21]. Recommenders can be used in (1) candidate generation level. In this stage, we are interested in filtering items that are less likely to be recommended. Filtered items are further ranked in the next stage by (2) ranking models. In this stage, we rank items according to preferences generated by ranker models. Recall metric is better used to evaluate the performance of recommenders in candidate generation because we do not care about ranking in this stage. We use a large truncation value of . For the ranking stage, we exploit ranking metrics such as MAP and NDCG with a small truncation value because only a handful of items can be recommended to users.

Our Model

We use two variants of our model. One has dot product as score function and denote it as , and the other exploits negative value of L2 distance of user vector and item vector as a score function and use to denote this variant.


SLIM [23]: Sparse Linear Method for Top-K recommendation. It is a state-of-the-art item based collaborative filtering algorithm where the item-item similarity matrix is represented as a sparse matrix. It generates scores of items for a user by weight sum of similarities of items that the user has previously consumed.

CDAE: [33, 20]

is a factor based recommeer, which represents user factors using an encoder, or a multi-layer perceptron whose input is embeddings of items that the user has consumed.

WMF [13] is a state-of-the-art matrix factorization model that uses pointwise loss and minimize loss using Alternating Least Squares.

BPR [24] is a matrix factorization model that exploits pairwise sigmoid cost function which is designed to optimize Area Under Curve of ROC score.

WARP [31] is a matrix factorization model and trained using hinge loss with approximated rank based weights.

CML [12] is a factor based recommendation model that models user-item preference as a negative value of the distance of user vector and item vector.

SQLRank-MF [32]

is matrix factorization models having cost function based on a permutation probability of list of items.


is a state-of-the-art factor based recommendation, which interpolates scores of unobserved feedbacks to be nonzero, giving differnt importances on unobserved feedback.

For WMF, and BPR, we used open-sourced implementation, Implicit 

[7]. For WARP, we used an open-sourced recommender, lightFM,  [17]. We used implementations publicly available by the authors for SLIM 222, SQLRank-MF333 and SRRMF444 We implemented CDAE, CML, and

in using Pytorch 1.5.0. We run our experiment on a machine with Intel(R) Xeon(R) CPU E5-2698 and NVIDIA Tesla V100 GPU with CUDA 10.1.

5.2 Alignment Between Train Cost and Evaluation Metrics

We conduct an illustrational experiment to show that the objective of (8). Figure 1

describes Normalized costs and mean AP@10 evaluated on training data over training epochs. We did not observe that decreasing costs also decrease performance on MAP@10. We conjecture this is because both models pose strong regularization, forcing

norm of the latent representations of users and items to be strictly equal to or smaller than . However, we observed that, as we claimed, we observe that losses during learning joint learning with model is more negatively correlated than that of WARP only. Correlation between loss and MAP@10 for WARP is , Correlation between loss and MAP@10 for WARP + DRM is . It means that our objective is more strongly related to the top- recommendation task.

Figure 1: Normalized Loss per sample and MAP@10 using training data versus training epochs Correlation. between loss and MAP@10 for WARP is -0.933, Correlation between loss and MAP@10 for WARP + DRM is -0.990.

5.3 Quantitative Results

width= Datasets Metrics Baselines Proposed Methods SLIM CDAE BPR WMF WARP CML SQL-Rank SRRMF SketchFab 0.0300 0.0351 0.0216 0.0335 0.0363 0.0358 0.0101 0.0200 0.0399 (9.9%) 0.0390 (7.4%) 0.1163 0.1301 0.0905 0.1257 0.1354 0.1379 0.0417 0.0862 0.1479 (7.2%) 0.1466 (6.3%) 0.2696 0.2793 0.2168 0.2862 0.2923 0.3040 0.1422 0.1550 0.3091 (1.6%) 0.3028 (-0.3%) 0.1067 0.1657 0.1218 0.1645 0.1724 0.1778 0.0537 0.0995 0.1855 (4.3%) 0.1836 (3.2%) Epinion 0.0086 0.0128 0.0062 0.0123 0.0100 0.0130 0.0036 0.0107 0.0144 (10.7%) 0.0137 (5.3%) 0.0357 0.0453 0.0238 0.0486 0.0387 0.0493 0.0168 0.0428 0.0532 (7.9%) 0.0523 (6.0%) 0.1081 0.1123 0.0661 0.1325 0.1158 0.1347 0.0432 0.1275 0.1361 (1.0%) 0.1308 (-2.8%) 0.0410 0.0646 0.0362 0.0726 0.0610 0.0736 0.0252 0.0680 0.0766 (4.0%) 0.0746 (1.3%) ML-20M 0.1287 0.1569 0.0787 0.1034 0.1030 0.1331 N/A 0.0987 0.1475 (-5.9%) 0.1598 (1.8%) 0.2761 0.3205 0.1917 0.2561 0.2300 0.2824 0.2532 0.3068(-4.2%) 0.3267 (1.9%) 0.4874 0.4829 0.3431 0.4676 0.4187 0.4874 0.4786 0.4944 (1.4%) 0.5014 (2.8%) 0.2511 0.3667 0.2394 0.3288 0.2887 0.3416 0.3244 0.3627 (-1.0%) 0.3912 (6.6%) Melon 0.0838 0.0612 0.0400 0.0562 0.0474 0.0692 N/A 0.0652 0.0764 (-8.8%) 0.0892 (6.4%) 0.1768 0.1041 0.0972 0.1303 0.1217 0.1659 0.1324 0.1802 (1.9%) 0.2010 (13.6%) 0.3415 0.1928 0.2159 0.2537 0.2577 0.3361 0.2863 0.3471 (1.6%) 0.3700 (8.3%) 0.2206 0.1335 0.1363 0.1716 0.1654 0.2206 0.1842 0.2334 (5.8%) 0.2550 (15.5%)

Table 2: Recommendation Performances of different methods. Results in boldface denotes our model outperforms with

with paired T test. Results in italic denotes our model outperforms but not statistically significant. Increment refers to the relative improvement over the performance of the best heuristics.

We cannot train SQLRank-MF in large datasets, ML-20M and Melon, because of its huge training time. It took about two days to train on Epinion dataset, smallest dataset among dataset we use, with our Machine, and even is impossible to run on ML-20m and Melon datasets. Therefore, we only record the performances of SQLRank-MF in SketchFab and Epinion datasets.

Table 2 shows the performance of various models for four datasets in terms of various ranking metrics. We observe proposed methods outperform state-of-the-art models by a large margin for all datasets we use. We observe that although matrix factorization models(BPR, WMF, and WARP, ) share the same model formulation, but the differences of performances among models are large. For example, WMF achieves smallest training error using pointwise loss, its prediction quality is below other pairwise model such as WARP, CML and our models in many datasets. Note that WARP behaves poorly in SketchFab dataset, however achieves best prediction qualities among models we evaluate. They only differ in the additional loss term . These trends are same for all other datasets. We credit this performance gain to the proposed objective, enabling factor based models to be aware of top- recommendation nature.

5.4 Effects of hyperparameters

(a) SketchFab
(b) Melon
Figure 2: Effects of the number of positive samples of DRM cost. Result of of our models. The weight of DRM cost set to be .

Our objective uses sampling items to sort and rank. Thus we only sample positive items and negative items to build a list to generate ranking during the training. In figure 2, we conduct an experiment to see the effect of the size of the number of positive items in the sampling. We set to be the and the number of positive items in the list. We set to be times of the .

5.5 Exploratory Analysis

(a) SketchFab
(b) ML-20M
Figure 3: NDCG@10 among different user groups by the number of interactions. The numbers in the parenthesis denote the number of users in each group.
CML DRM-only
MAP@10 0.0358 0.0310 0.0390
NDCG@10 0.1379 0.1234 0.1466
Recall@50 0.3040 0.2802 0.3028
NDCG@50 0.1778 0.1608 0.1836
Table 3: Comparison on Melon dataset with joint learning () with pairwise loss only (CML), and DRM Only.

Figure 3 shows

of user groups grouped by the number of interactions in the training datasets. Our loss function consistently improves recommendation performances for all user groups, especially when the number of interactions of users is small. For CML, our additional loss significantly improves.

In figure 3, We evaluate the model learned using DRM cost only, over two models, one trained CML and one trained only with DRM cost. We find that the model trained only with DRM loss performs worse than other two models. We conjecture one possible reason: (1) DRM cost learns from the only small subset of total consumptions because of the training complexity, and (2) Pairwise approach has some advantage over listwise losses. Either cases, Joint learning enables to learn from both worlds, yielding best performances of .

6 Conclusion

Although recommender systems show promising results, their performances in terms of personalized ranking might be suboptimal because they are not optimized to ranking directly. In this work, we have proposed DRM, a differentiable ranking metric which enables sorting included end-to-end training for factor based recommenders. DRM utilizes relaxation of permutation matrix with listwise approach, leading to listwise cost function which directly maximize metrics as Precision. With experiments, we demonstrate that DRM yields better qualities of recommendations by comparing with several state-of-the-art methods on four real-world dataset. In addition, we prove that our cost function monotonically decrease the errors in the ranking, resulting ranking correction effects. In the future, we plan to explore to apply our cost function to various recommendation models, such as AutoEncoder, or deep learning inspired models. In addition, it is interesting to investigate various ranking metrics, whereas we only explored in the relaxation of Precision.


This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) and funded by Kakao I Research Supporting Program.


  • [1] D. C. Blair and M. E. Maron (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM 28 (3), pp. 289–299. Cited by: §2.2.
  • [2] C. J. C. Burges, R. Ragno, and Q. V. Le (2006) Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pp. 193–200. Cited by: §1.
  • [3] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §2.1.
  • [4] J. Chen, D. Lian, and K. Zheng (2019) Improving one-class collaborative filtering via ranking-based implicit regularizer. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 37–44. Cited by: §4, §5.1.
  • [5] P. Covington, J. Adams, and E. Sargin (2016)

    Deep neural networks for youtube recommendations

    In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016, pp. 191–198. Cited by: §5.1.
  • [6] P. Cremonesi, Y. Koren, and R. Turrin (2010) Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems, pp. 39–46. Cited by: §2.1.
  • [7] B. Fredrickson (2017) Fast python collaborative filtering for implicit datasets. GitHub. Note: Cited by: §5.1.
  • [8] C. A. Gomez-Uribe and N. Hunt (2016) The netflix recommender system: algorithms, business value, and innovation. ACM Trans. Manag. Inf. Syst. 6 (4), pp. 13:1–13:19. Cited by: §5.1.
  • [9] A. Grover, E. Wang, A. Zweig, and S. Ermon (2019) Stochastic optimization of sorting networks via continuous relaxations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1, §3.2, §3.2.
  • [10] F. M. Harper and J. A. Konstan (2016)

    The movielens datasets: history and context

    ACM Trans. Interact. Intell. Syst. 5 (4), pp. 19:1–19:19. Cited by: §5.1.
  • [11] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pp. 173–182. Cited by: §3.3.
  • [12] C. Hsieh, L. Yang, Y. Cui, T. Lin, S. J. Belongie, and D. Estrin (2017) Collaborative metric learning. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pp. 193–201. Cited by: §1, §3.1, §3.1, §3.3, §5.1.
  • [13] Y. Hu, Y. Koren, and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy, pp. 263–272. Cited by: §3.1, §5.1.
  • [14] S. Huang, S. Wang, T. Liu, J. Ma, Z. Chen, and J. Veijalainen (2015) Listwise collaborative filtering. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, pp. 343–352. External Links: ISBN 9781450336215, Link, Document Cited by: §1, §1, §2.1, §4.
  • [15] K. Järvelin and J. Kekäläinen (2002-10) Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20 (4), pp. 422–446. External Links: ISSN 1046-8188, Link, Document Cited by: §2.2.
  • [16] C. C. Johnson (2014) Logistic matrix factorization for implicit feedback data. In Advances in Neural Information Processing Systems 27, Cited by: §3.3.
  • [17] M. Kula (2015) Metadata embeddings for user and item cold-start recommendations. In Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systems co-located with 9th ACM Conference on Recommender Systems (RecSys 2015), Vienna, Austria, September 16-20, 2015., T. Bogers and M. Koolen (Eds.), CEUR Workshop Proceedings, Vol. 1448, pp. 14–21. External Links: Link Cited by: §5.1.
  • [18] O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2177–2185. External Links: Link Cited by: §4.
  • [19] D. Liang, J. Altosaar, L. Charlin, and D. M. Blei (2016) Factorization meets the item embedding: regularizing matrix factorization with item co-occurrence. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016, pp. 59–66. Cited by: §4.
  • [20] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara (2018) Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, WWW ’18. Cited by: §5.1.
  • [21] J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H. Chi (2020) Off-policy learning in two-stage recommender systems. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, pp. 463–473. Cited by: §5.1.
  • [22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. External Links: Link Cited by: §4.
  • [23] X. Ning and G. Karypis (2011) SLIM: sparse linear methods for top-n recommender systems. In 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011, pp. 497–506. Cited by: §5.1.
  • [24] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pp. 452–461. Cited by: §1, §3.1, §3.3, §4, §5.1.
  • [25] E. Rosenthal (2016) Likes out! guerilla dataset!. GitHub. Note: Cited by: §5.1.
  • [26] R. Salakhutdinov and A. Mnih (2008) Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, Vol. 20. Cited by: §3.1.
  • [27] B. Schwartz (2018) The paradox of choice. Wiley. Cited by: §1.
  • [28] Y. Shi, M. A. Larson, and A. Hanjalic (2010) List-wise learning to rank with matrix factorization for collaborative filtering. In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain, September 26-30, 2010, pp. 269–272. Cited by: §2.1.
  • [29] J. Tang, H. Gao, H. Liu, and A. Das Sarma (2012) ETrust: understanding trust evolution in an online world. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 253–261. External Links: ISBN 9781450314626 Cited by: §5.1.
  • [30] B. L. Welch (1947)

    The generalization ofstudent’s’ problem when several different population variances are involved

    Biometrika 34 (1/2), pp. 28–35. Cited by: §5.1.
  • [31] J. Weston, S. Bengio, and N. Usunier (2011) WSABIE: scaling up to large vocabulary image annotation. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, T. Walsh (Ed.), pp. 2764–2770. Cited by: §1, §5.1.
  • [32] L. Wu, C. Hsieh, and J. Sharpnack (2018) SQL-rank: a listwise approach to collaborative ranking. In Proceedings of Machine Learning Research (35th International Conference on Machine Learning), Vol. 80. Cited by: §1, §2.1, §5.1.
  • [33] Y. Wu, C. DuBois, A. X. Zheng, and M. Ester (2016) Collaborative denoising auto-encoders for top-n recommender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 153–162. Cited by: §3.3, §5.1.
  • [34] J. Xu, T. Liu, M. Lu, H. Li, and W. Ma (2008) Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 107–114. Cited by: §1.