1. Introduction
Internet users can easily access an increasingly vast number of online products and services, and it is becoming very difficult for users to identify the items that will appeal to them out of a plethora of candidates. To reduce information overload and to satisfy the diverse needs of users, personalized recommender systems have emerged and they are beginning to play an important role in modern society. These systems can provide personalized experiences, serve huge service demands, and benefit both the userside and supplyside. They can: (i) help users easily discover products that are likely to interest them; and (ii) create opportunities for product and service providers to better serve customers and to increase revenue.
In all kinds of recommender systems, modeling the useritem interaction lies at the core. There are two common ways used in recent recommendation models to infer the user preference: matrix factorization (MF) and multilayer perceptrons (MLPs). MFbased methods (e.g.,
(DBLP:conf/icdm/HuKV08; DBLP:conf/uai/RendleFGS09)) apply the inner product between latent factors of users and items to predict the user preferences for different items. The latent factors strive to depict the useritem relationships in the latent space. In contrast, MLPbased methods (e.g., (DBLP:conf/www/HeLZNHC17; DBLP:conf/recsys/CovingtonAS16)) adopt (deep) neural networks to learn nonlinear useritem relationships, which can generate better latent feature combinations between the embeddings of users and items
(DBLP:conf/www/HeLZNHC17).However, both MFbased and MLPbased methods violate the triangle inequality (DBLP:conf/nips/Shrivastava014), and as a result may fail to capture the finegrained preference information (DBLP:conf/www/HsiehYCLBE17). As a concrete example in (DBLP:conf/icdm/ParkKXY18), if a user accessed two items, MF or MLPbased methods will put both items close to the user, but will not necessarily put these two items close to each other, even if they share similar properties.
To address the limitations of MF and MLPbased methods, metric (distance) learning approaches have been utilized in the recommendation model (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/icdm/ParkKXY18; DBLP:conf/www/TayTH18; DBLP:journals/corr/LiZZQZHH20), as the distance naturally satisfies the triangle inequality. These techniques project users and items into a lowdimensional metric space, where the user preference is measured by the distance to items. Specifically, CML (DBLP:conf/www/HsiehYCLBE17) and LRML (DBLP:conf/www/TayTH18) are two representative models. CML minimizes the Euclidean distance between users and their accessed items, which facilitates useruser/itemitem similarity learning. LRML incorporates a memory network to introduce additional capacity to learn relations between users and items in the metric space.
Although existing distancebased methods have achieved satisfactory results, we argue that there are still several avenues for enhancing performance. First, previous distancebased methods (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/icdm/ParkKXY18; DBLP:conf/www/TayTH18; DBLP:journals/corr/LiZZQZHH20) learn the user and item embeddings in a deterministic manner without handling the uncertainty. Relying solely on the learned deterministic embeddings may lead to an inaccurate understanding of user preferences. A motivating example is shown in Figure 1. After having accessed two songs and with different genres, the user may be placed between of and . If we only consider deterministic embeddings, should be a good candidate. But if we consider the embeddings from a probabilistic perspective, can be a better recommendation and it has the same genre as . Second, most of the existing methods (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18) adopt the margin ranking loss (hinge loss) with a fixed margin as the hyperparameter. We argue that the margin value should be adaptive and relevant to corresponding training samples. Furthermore, different training phases may need different magnitudes of margin values. Setting a fixed value may not be an optimal solution. Third, previous distancebased methods (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18; DBLP:journals/corr/LiZZQZHH20) do not explicitly model useruser and itemitem relationships. Closelyrelated users are very likely to share the same interests, and if two items have similar attributes it is likely that a user will favour both. When inferring a user’s preferences, we should explicitly take into account the useruser and itemitem similarities.
To address the shortcomings, we propose a Probabilistic Metric Learning model with an Adaptive Margin (PMLAM) for TopK recommendation. PMLAM consists of three major components: 1) a useritem interaction module, 2) an adaptive margin generation module, and 3) a useruser/itemitem relation modeling module. To capture the uncertainties in the learned user and item embeddings, each user or item is parameterized with one Gaussian distribution, where the distribution related parameters are learned by our model. In the useritem interaction module, we adopt the Wasserstein distance to measure the distances between users and items, thus not only taking into account the means but also the uncertainties. In the adaptive margin generation module, we model the learning of adaptive margins as a bilevel (inner and outer) optimization problem (DBLP:journals/tec/SinhaMD18), where we build a proxy function to explicitly link the learning of margin related parameters with the outer objective function. In the useruser and itemitem relation modeling module, we incorporate two margin ranking losses with adaptive margins for userpairs and itempairs, respectively, to explicitly encourage similar users or items to be mapped closer to one another in the latent space. We extensively evaluate our model by comparing with many stateoftheart methods, using two performance metrics on five realworld datasets. The experimental results not only demonstrate the improvements of our model over other baselines but also show the effectiveness of the proposed modules.
To summarize, the major contributions of this paper are:

[leftmargin=*]

To capture the uncertainties in the learned user/item embeddings, we represent each user and item as a Gaussian distribution. The Wasserstein distance is leveraged to measure the user preference for items while simultaneously considering the uncertainty.

To generate an adaptive margin, we cast margin generation as a bilevel optimization problem, where a proxy function is built to explicitly update the margin generation related parameters.

To explicitly model the useruser and itemitem relationships, we apply two margin ranking losses with adaptive margins to force similar users and items to map closer to one another in the latent space.

Experiments on five realworld datasets show that the proposed PMLAM model significantly outperforms the stateoftheart methods for the TopK recommendation task.
2. Related Work
In this section we summarize and discuss work that is related to our proposed topK recommendation model.
In many realworld recommendation scenarios, user implicit data (DBLP:conf/kdd/WangWY15; DBLP:conf/kdd/MaKL19), e.g., clicking history, is more common than explicit feedback (DBLP:conf/icml/SalakhutdinovMH07) such as user ratings. The implicit feedback setting, also called oneclass collaborative filtering (OCCF) (DBLP:conf/icdm/PanZCLLSY08), arises when only positive samples are available. To tackle this challenging problem, effective methods have been proposed.
Matrix Factorizationbased Methods. Popularized by the Netflix prize competition, matrix factorization (MF) based methods have become a prominent solution for personalized recommendation (DBLP:journals/computer/KorenBV09). In (DBLP:conf/icdm/HuKV08)
, Hu et al. propose a weighted regularized matrix factorization (WRMF) model to treat all the missing data as negative samples, while heuristically assigning confidence weights to positive samples. Rendle et al. adopt a different approach in
(DBLP:conf/uai/RendleFGS09), proposing a pairwise ranking objective (Bayesian personalized ranking) to model the pairwise relationships between positive items and negative items for each user, where the negative samples are randomly sampled from the unobserved feedback. To allow unobserved items to have varying degrees of importance, He et al. in (DBLP:conf/sigir/HeZKC16) propose to weight the missing data based on item popularity, demonstrating improved performance compared to WRMF.Multilayer Perceptronbased Methods. Due to their ability to learn more complex nonlinear relationships between users and items, (deep) neural networks have been a great success in the domain of recommender systems. He et al. in (DBLP:conf/www/HeLZNHC17) propose a neural networkbased collaborative filtering model, where a multilayer perceptron is used to learn the nonlinear useritem interactions. In (DBLP:conf/wsdm/WuDZE16; DBLP:conf/cikm/MaZWL18; DBLP:conf/wsdm/MaKWWL19)
, (denoising) autoencoders are employed to learn the user or item hidden representations from user implicit feedback. Autoencoder approaches can be shown to be generalizations of many of the MF methods
(DBLP:conf/wsdm/WuDZE16). In (DBLP:conf/ijcai/XueDZHC17; DBLP:conf/ijcai/GuoTYLH17), conventional matrix factorization and factorization machine methods benefit from the representation ability of deep neural networks for learning either the useritem relationships or the interactions with side information. Graph neural networks (GNNs) have recently been incorporated in recommendation algorithms because they can learn and model relationships between entities (DBLP:conf/sigir/Wang0WFC19; DBLP:conf/icdm/SunZMCGTH19; DBLP:conf/aaai/MaMZSLC20).Distancebased Methods. Due to their capacity to measure the distance between users and items, distancebased methods have been successfully applied in TopK recommendation. In (DBLP:conf/www/HsiehYCLBE17), Hsieh et al. propose to compute the Euclidean distance between users and items for capturing finegrained user preference. In (DBLP:conf/www/TayTH18), Tay et al. adopt a memory network (DBLP:conf/nips/SukhbaatarSWF15) to explicitly store the user preference in external memories. Park et al. in (DBLP:conf/icdm/ParkKXY18) apply a translation emb edding to capture more complex relations between users and items, where the translation embedding is learned from the neighborhood information of users and items. In (DBLP:conf/recsys/HeKM17), He et al. apply a distance metric to capture how the user interest shifts across sequential useritem interactions. In (DBLP:journals/corr/LiZZQZHH20), Li et al. propose to measure the trilateral relationship from both the usercentric and itemcentric perspectives and learn adaptive margins for the central user and positive item.
Our proposed recommendation model is different in key ways from all of the methods identified above. In contrast to the matrix factorization (DBLP:conf/icdm/HuKV08; DBLP:conf/uai/RendleFGS09; DBLP:conf/sigir/HeZKC16) and neural network methods (DBLP:conf/www/HeLZNHC17; DBLP:conf/wsdm/WuDZE16; DBLP:conf/cikm/MaZWL18; DBLP:conf/wsdm/MaKWWL19; DBLP:conf/ijcai/XueDZHC17; DBLP:conf/ijcai/GuoTYLH17; DBLP:conf/sigir/Wang0WFC19), we employ the Wasserstein distance that obeys the triangle inequality. This is important for ensuring that users with similar interaction histories are mapped close together in the latent space. In contrast to most of the prior distancebased approaches, (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18; DBLP:journals/corr/LiZZQZHH20)
, we employ parameterized Gaussian distributions to represent each user and item in order to capture the uncertainties of learned user preferences and item properties. Moreover, we formulate a bilevel optimization problem and incorporate a neural network to generate adaptive margins for the commonly applied margin ranking loss function.
3. Problem Formulation
The recommendation task considered in this paper takes as input the user implicit feedback. For each user , the user preference data is represented by a set that includes the items she preferred, e.g., , where is an item index in the dataset. The top recommendation task in this paper is formulated as: given the training item set , and the nonempty test item set (requiring that and ) of user , the model must recommend an ordered set of items such that and . Then the recommendation quality is evaluated by a matching score between and , such as Recall@.
4. Methodology
In this section, we present the proposed model shown in Fig. 2. We first introduce the useritem interaction module, which captures the useritem interactions by calculating the Wasserstein distance between users’ and items’ distributions. Then we describe the adaptive margin generation module, which generates adaptive margins during the training process. Next, we present the useruser and itemitem relation modeling module. Lastly, we specify the objective function and explain the training process of the proposed model.
4.1. Wasserstein Distance for Interactions
Previous works (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18) use the user and item embeddings in a deterministic manner and do not measure or learn the uncertainties of user preferences and item properties. Motivated by probabilistic matrix factorization (PMF) (DBLP:conf/nips/SalakhutdinovM07), we represent each user or item as a single Gaussian distribution. In contrast to PMF, which applies Gaussian priors on user and item embeddings, users and items in our model are parameterized by Gaussian distributions, where the means and covariances are directly learned. Specifically, the latent factors of user and item are represented as:
(1)  
Here and
are the learned mean vector and covariance matrix of user
, respectively; and are the learned mean vector and covariance matrix of item . To limit the complexity of the model and reduce the computational overhead, we assume that the embedding dimensions are uncorrelated. Thus, is a diagonal covariance matrix that can be represented as a vector. Specifically, and , where is the dimension of the latent space.Widely used distance metrics for deterministic embeddings, like the Euclidean distance, do not properly measure the distance between distributions. Since users and items are represented by probabilistic distributions, we need a distance measure between distributions. Among the commonly used distance metric between distributions, we adopt the Wasserstein distance to measure the user preference for an item. The reasons are twofold: i) the Wasserstein distance satisfies all the properties a distance should have; and ii) the Wasserstein distance has a simple form when calculating the distance between Gaussian distributions (DBLP:conf/nips/MallastoF17). Formally, the
th Wasserstein distance between two probability measures
and on a Polish metric space (srivastava2008course) is defined (Givens_1984):where is an arbitrary distance with moment (casella2002statistical) for a deterministic variable, ; and denotes the set of all measures on which admit and as marginals. When , the th Wasserstein distance preserves all properties of a metric, including both symmetry and the triangle inequality.
The calculation of the general Wasserstein distance is computationintensive (DBLP:conf/uai/XieWWZ19). To reduce the computational cost, we use Gaussian distributions for the latent representations of users and items. Then when , the nd Wasserstein distance (abbreviated as ) has a closed form solution, thus making the calculation process much faster. Specifically, we have the following formula to calculate the distance between user and item (Givens_1984):
(2)  
In our setting, we focus on diagonal covariance matrices, thus . For simplicity, we use to denote the left hand side of Eq. 2. Then Eq. 2 can be simplified as:
(3) 
According to the above equation, the time complexity of calculating distance between the latent representations of users and items is linear with the embedding dimension.
4.2. Adaptive Margin in Margin Ranking Loss
To learn the distancebased model, most of the existing works (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18) apply the margin ranking loss to measure the user preference difference between positive items and negative items. Specifically, the margin ranking loss makes sure the distance between a user and a positive item is less than the distance between the user and a negative item by a fixed margin . The loss function is:
(4) 
where is an item that user has accessed, and is a randomly sampled item treated as the negative example, and . Thus, represents a training triplet.
The safe margin in the margin ranking loss is a crucial hyperparameter that has a major impact on the model performance. A fixed margin value may not achieve satisfactory performance. First, using a fixed value does not allow for adaptation to distinguish the properties of the training triplets. For example, some users have broad interests, so the margins for these users should not be so large as to make potential preferred items too far from the user. Other users have very focused interests, and it is desirable to have a larger margin to avoid recommending items that are not directly within the focus. Second, in different training phases, the model may need different magnitudes of margins. For instance, in the early stage of training, the model is not reliable enough to make strong predictions on user preferences, and thus imposing a large margin risk pushing potentially positive items too far from a user. Third, to achieve satisfactory performance, the selection of a fixed margin involves tedious hyperparameter tuning. Based on these considerations, we conclude that setting a fixed margin value for all training triplets may limit the model expressiveness.
To address the problems outlined above, we propose an adaptive margin generation scheme which generates margins according to the training triplets. Formally, we formulate the margin ranking loss with an adaptive margin as:
(5) 
Here is a function that generates the specific margin based on the corresponding user and item embeddings and is the learnable set of parameters associated with . Then we could consider optimizing and simultaneously:
(6) 
Unfortunately, directly minimizing the objective function as in Eq. 6 does not achieve the desired purpose of generating suitable adaptive margins. Since the marginrelated term explicitly appears in the loss function, constantly decreasing the value of the generated margin is the straightforward way to reduce the loss. As a result all generated margins have very small values or are set to zero, leading to unsatisfactory results. In other words, the direct optimization of with respect to harms the optimization of .
4.2.1. Bilevel Optimization
We model the learning of recommendation models and the generation of adaptive margins as a bilevel optimization problem (DBLP:journals/anor/ColsonMS07):
(7)  
Here contains the model parameters and . The objective function attempts to minimize with respect to while the objective function optimizes with respect to through . For simplicity, the of in is set to for guiding the learning of . Thus, we can have an alternating optimization to learn and :

update phase (Inner Optimization): Fix and optimize .

update phase (Outer Optimization): Fix and optimize .
4.2.2. Approximate Gradient Optimization
As most existing models utilize gradientbased methods for optimization, a simple approximation strategy with less computation is introduced as follows:
(8) 
In this expression, denotes the current parameters including and , and is the learning rate for one step of inner optimization. Related approximations have been validated in (DBLP:conf/wsdm/Rendle12; DBLP:conf/iclr/LiuSY19). Thus, we can define a proxy function to link with the outer optimization:
(9) 
For simplicity, we use two optimizers and to update and , respectively. The iterative procedure is shown in Alg. 1.
4.2.3. The design of
We parameterize with a neural network to generate the margin based on :
(10)  
Here and are learnable parameters in , is the input to generate the margin, and is the generated margin of
. The activation function
softplus guarantees . To promote a discrimative that reflects the relation between and , the following form can be a finegrained indicator:(11)  
Here is introduced to mimic the calculation of Euclidean distance without summing over all dimensions. denotes elementwise subtraction and denotes the concatenation operation. To improve the robustness of , we take as inputs the sampled embeddings and
. To perform backpropagation from
and , we adopt the reparameterization trick (DBLP:journals/corr/KingmaW13) for Eq. 1:(12)  
where and is elementwise muliplication.
4.3. UserUser and ItemItem Relations
It is important to model the relationships between pairs of users or pairs of items when developing recommender systems and strategies for doing so effectively have been studied for many years (DBLP:conf/www/SarwarKKR01; DBLP:conf/kdd/KabburNK13; DBLP:reference/sp/NingDK15). For example, itembased collaborative filtering methods use item rating vectors to calculate the similarities between the items. Closelyrelated users or items may share the same interests or have similar attributes. For a certain user, items similar to the user’s preferred items are potential recommendation candidates.
Despite this intuition, previous distancebased recommendation methods (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18) do not explicitly take the useruser or itemitem relationships into consideration. As a result of relying primarily on useritem information, the systems may fail to generate appropriate useruser or itemitem distances. To model the relationships between similar users or items, we employ two ranking margin losses with adaptive margins to encourage similar users or items to be mapped closer together in the latent space. Formally, the similarities between users or items are calculated from the user implicit feedback, which can be represented by a binary useritem interaction matrix. We set a threshold on the calculated similarities to identify the similar users and items for a specific user and item , respectively, denoted as and . We adopt the following losses for user pairs and item pairs, respectively:
(13)  
(14) 
where is a randomly sampled user in Eq. 13 and a randomly sampled item in Eq. 14. denotes the useruser relation and denotes the itemitem relation. We use and to update and , respectively, which are the same as in Alg. 1. We denote the indicator in Eq. 11 as , then we generate and following the procedure described by Eq. 11.
4.4. Model Training
Let us denote the losses and to capture the interactions between users and items. Then we combine the loss functions presented in Section 4.3 to optimize the proposed model:
(15)  
where is a regularization parameter. We follow the same training scheme of Section 4.2 to train Eq. 15
. To mitigate the curse of dimensionality issue
(DBLP:conf/nips/BordesUGWY13) and prevent overfitting, we bound all the user/item embeddings within a unit sphere after each minibatch training: and . When minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation.Recommendation Phase. In the testing phase, for a certain user , we compute the distance between user and each item in the dataset. Then the items that are not in the training set and have the shortest distances are recommended to user .
5. Experiments
In this section, we evaluate the proposed model, comparing with the stateoftheart methods on five realworld datasets.
5.1. Datasets
The proposed model is evaluated on five realworld datasets from various domains with different sparsities: Books, Electronics and CDs (DBLP:conf/www/HeM16), Comics (DBLP:conf/recsys/WanM18) and Gowalla (DBLP:conf/kdd/ChoML11). The Books, Electronics and CDs datasets are adopted from the Amazon review dataset with different categories, i.e., books, electronics and CDs. These datasets include a significant amount of useritem interaction data, e.g., user ratings and reviews. The Comics dataset was collected in late 2017 from the GoodReads website with different genres, and we use the genres of comics. The Gowalla dataset was collected worldwide from the Gowalla website (a locationbased social networking website) over the period from February 2009 to October 2010. In order to be consistent with the implicit feedback setting, we retain any ratings no less than four (out of five) as positive feedback and treat all other ratings as missing entries for all datasets. To filter noisy data, we only include users with at least ten ratings and items with at least five ratings. Table 1 shows the data statistics.
We employ fivefold crossvalidation to evaluate the proposed model. For each user, the items she accessed are randomly split into five folds. We pick one fold each time as the ground truth for testing, and the remaining four folds constitute the training set. The average results over the five folds are reported.
Dataset  #Users  #Items  #Interactions  Density 
Books  77,754  66,963  2,517,343  0.048% 
Electronics  40,358  28,147  524,906  0.046% 
CDs  24,934  24,634  478,048  0.079% 
Comics  37,633  39,623  2,504,498  0.168% 
Gowalla  64,404  72,871  1,237,869  0.034% 
5.2. Evaluation Metrics
We evaluate all models in terms of Recall@k and NDCG@k. For each user, Recall@k (R@k) indicates the percentage of her rated items that appear in the top recommended items. NDCG@k (N@k) is the normalized discounted cumulative gain at , which takes the position of correctly recommended items into account.
5.3. Methods Studied
To demonstrate the effectiveness of our model, we compare to the following recommendation methods.
Classical methods for implicit feedback:

[leftmargin=*]

BPRMF, Bayesian Personalized Rankingbased Matrix Factorization (DBLP:conf/uai/RendleFGS09), which is a classic method for learning pairwise personalized rankings from user implicit feedback.
Classical neuralbased recommendation methods:

[leftmargin=*]

NCF, Neural Collaborative Filtering (DBLP:conf/www/HeLZNHC17), which combines the matrix factorization (MF) model with a multilayer perceptron (MLP) to learn the useritem interaction function.

DeepAE, the deep autoencoder (DBLP:conf/cikm/MaZWL18), which utilizes a threehiddenlayer autoencoder with a weighted loss function.
Stateoftheart distancebased recommendation methods:

[leftmargin=*]

CML, Collaborative Metric Learning (DBLP:conf/www/HsiehYCLBE17), which learns a metric space to encode the useritem interactions and to implicitly capture the useruser and itemitem similarities.

LRML, Latent Relational Metric Learning (DBLP:conf/www/TayTH18), which exploits an attentionbased memoryaugmented neural architecture to model the relationships between users and items.

TransCF, Collaborative Translational Metric Learning (DBLP:conf/icdm/ParkKXY18), which employs the neighborhood of users and items to construct translation vectors capturing the intensity of user–item relations.

SML, Symmetric Metric Learning with adaptive margin (DBLP:journals/corr/LiZZQZHH20), which measures the trilateral relationship from both the user and itemcentric perspectives and learns adaptive margins.
The proposed method:

[leftmargin=*]

PMLAM, the proposed model, which represents each user and item as Gaussian distributions to capture the uncertainties in user preferences and item properties, and incorporates an adaptive margin generation mechanism to generate the margins based on the sampled useritem triplets.
5.4. Experiment Settings
In the experiments, the latent dimension of all the models is set to for a fair comparison. All the models adopt the same negative sampling strategy with the proposed model, unless otherwise specified. For BPRMF, the learning rate is set to and the regularization parameter is set to . With these parameters, the model can achieve good results. For NCF, we follow the same model structure as in the original paper (DBLP:conf/www/HeLZNHC17). The learning rate is set to and the batch size is set to . For DeepAE, we adopt the same model structure employed in the authorprovided code and set the batch size to . The weight of the positive items is selected from by a grid search and the weights of all other items are set to as recommended in (DBLP:conf/icdm/HuKV08). For CML, we use the authors’ implementation to set the margin to and the regularization parameter to . For LRML, the learning rate is set to , and the number of memories is selected from by a grid search. For TransCF, we follow the settings in the original paper to select and set the margin to and batch size to , respectively. For SML, we follow the author’s code to set the user and item margin bound to , to and to , respectively.
For our model, both the learning rate and are set to . For the and datasets, we randomly sample unobserved users or items as negative samples for each user and positive item. This number is reduced to for the other datasets to speed up the training process. The batch size is set to for all datasets. The dimension is set to
. The user and item embeddings are initialized by drawing each vector element independently from a zeromean Gaussian distribution with a standard deviation of
. Our experiments are conducted with PyTorch running on GPU machines (Nvidia Tesla P100).
BPRMF  NCF  DeepAE  CML  LRML  TransCF  SML  PMLAM  Improv.  
Recall@10  
Books  0.0553  0.0568  0.0817  0.0730  0.0565  0.0754  0.0581  0.0885**  8.32% 
Electronics  0.0243  0.0277  0.0253  0.0395  0.0299  0.0353  0.0279  0.0469***  18.73% 
CDs  0.0730  0.0759  0.0736  0.0922  0.0822  0.0851  0.0793  0.1129***  22.45% 
Comics  0.1966  0.2092  0.2324  0.1934  0.1795  0.1967  0.1713  0.2417  4.00% 
Gowalla  0.0888  0.0895  0.1113  0.0840  0.0935  0.0824  0.0894  0.1331***  19.58% 
NDCG@10  
Books  0.0391  0.0404  0.0590  0.0519  0.0383  0.0542  0.0415  0.0671**  13.72% 
Electronics  0.0111  0.0125  0.0134  0.0178  0.0117  0.0148  0.0105  0.0234***  31.46% 
CDs  0.0383  0.0402  0.0411  0.0502  0.0420  0.0461  0.0423  0.0619***  23.30% 
Comics  0.2247  0.2395  0.2595  0.2239  0.1922  0.2341  0.1834  0.2753*  6.08% 
Gowalla  0.0806  0.0822  0.0944  0.0611  0.0670  0.0611  0.0823  0.0984*  4.23% 
, respectively, compared to the best baseline method based on the paired ttest.
Improv. denotes the improvement of our model over the best baseline method.5.5. Implementation Details
To speed up the training process, we implement a twophase sampling strategy. We sample a number of candidates, e.g., 500, of negative samples for each user every 20 epochs to form a candidate set. During the next 20 epochs, the negative samples of each user are sampled from her candidate set. This strategy can be implemented using multiple processes to further reduce the training time.
Since none of the processed datasets has inherent useruser/itemitem information, we treat the useritem interaction as a useritem matrix and compute the cosine similarity for the user and item pairs, respectively
(DBLP:conf/www/SarwarKKR01). We set a threshold, e.g., on Amazon and Gowalla datasets and on the Comics dataset, to select the neighbors. These thresholds are chosen to ensure a reasonable degree of connectivity in the constructed graphs.5.6. Performance Comparison
The performance comparison is shown in Figure 3 and Table 2. Based on these results, we have several observations.
Observations about our model. First
, the proposed model, PMLAM, achieves the best performance on all five datasets with both evaluation metrics, which illustrates the superiority of our model.
Second
, PMLAM outperforms SML. Although SML has an adaptive margin mechanism, it is achieved by having a learnable scalar margin for each user and item and adding a regularization term to prevent the learned margins from being too small. It can be challenging to identify an appropriate regularization weight via hyperparameter tuning. By contrast, PMLAM formulates the adaptive margin generation as a bilevel optimization, avoiding the additional regularization. PMLAM employs a neural network to generate the adaptive margin, so the number of parameters related to margin generation does not increase with the number of users or items.
Third, PMLAM achieves better performance than TransCF. One major reason is that TransCF only considers the items rated by a user and the users who rated an item as the neighbors of the user and item, respectively, which neglects the useruser/itemitem relations. PMLAM models the useruser/itemitem relations by two margin ranking losses with adaptive margins.
Fourth, PMLAM makes better recommendations than CML and LRML. These methods apply a fixed margin for all useritem triplets and do not measure or model the uncertainty of learned user/item embeddings. PMLAM represents each user and item as a Gaussian distribution, where the uncertainties of learned user preferences and item properties are captured by the covariance matrices.
Fifth, PMLAM outperforms NCF and DeepAE. These are MLPbased recommendation methods with the ability to capture nonlinear useritem relationships, but they violate the triangle inequality when modeling useritem interaction. As a result, they can struggle to capture the finegrained user preference for particular items (DBLP:conf/www/HsiehYCLBE17).
Other observations. First, all of the results reported
for the Comics dataset are
considerably better than those for the other datasets. The other four datasets are sparser and data sparsity negatively impacts recommendation performance.
Second, CML, LRML and TransCF perform better than SML on most of
the datasets. The adaptive margin regularization term in SML
struggles to adequately counterbalance SML’s tendency to
reduce the loss by imposing small margins. Although it is reported
that SML outperforms CML, LRML and TransCF
in (DBLP:journals/corr/LiZZQZHH20), the experiments are conducted
on three relatively smallscale datasets with only several thousands
of users and items. We experiment with much larger datasets;
identifying a successful regularization setting appears to be more
difficult as the number of users increases.
Third, TransCF outperforms LRML on most of the datasets. One
possible reason is that TransCF has a more effective translation
embedding learning mechanism, which incorporates the neighborhood information of users and items. TransCF also has a regularization term to further pull positive items closer to the anchor user.
Fourth, CML achieves better performance than LRML on most of the
datasets. CML integrates the weighted
approximaterank pairwise (WARP) weighting
scheme (DBLP:journals/ml/WestonBU10) in the loss function to
penalize lowerranked positive items. The comparison between CML and
LRML in (DBLP:conf/www/TayTH18) removes this component of CML.
The WARP scheme appears to play an important role in improving CML’s performance.
Fifth, DeepAE outperforms NCF. The heuristic weighting function
of DeepAE can impose useful penalties to errors that occur during training when positive items are assigned lower prediction scores.
Architecture  CDs  Electronics  

R@10  N@10  R@10  N@10  
(1) + Deter_Emb  0.0721  0.0371  0.0241  0.0090 
(2) + Gauss_Emb  0.0815  0.0434  0.0296  0.0110 
(3) + Deter_Emb  0.0777  0.0415  0.0338  0.0125 
(4)  + Deter_Emb  0.0408  0.0204  0.0139  0.0055 
(5)  + Deter_Emb  0.0311  0.0158  0.0050  0.0018 
(6) + Gauss_Emb  0.0856  0.0454  0.0365  0.0155 
(7) + +  0.0966  0.0526  0.0429  0.0189 
(8) PMLAM  0.1129  0.0619  0.0469  0.0234 
5.7. Ablation Analysis
To verify and assess the relative effectiveness of the proposed useritem interaction module, the adaptive margin generation module, and the useruser/itemitem relation module, we conduct an ablation study. Table 3 reports the performance improvement achieved by each module of the proposed model. Note that we compute Euclidean distances between deterministic embeddings. In (1), which serves as a baseline, we use the hinge loss with a fixed margin (Eq. 4) on deterministic embeddings of users and items to capture the useritem interaction ( is set to which is commonly used in (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18)). In (2), as an alternative baseline, we apply the same hinge loss as in (1), but replace the deterministic embeddings with parameterized Gaussian distributions (Section 4.1). In (3), we use the adaptive margin generation module (Section 4.2) to generate the margins for deterministic embeddings. In (4), we concatenate the deterministic embeddings of to generate instead of using Eq. 11. In (5), we sum the deterministic embeddings of to generate instead of using Eq. 11. In (6), we combine (2) and (3) to generate the adaptive margins for Gaussian embeddings. In (7), we augment (6) with useruser/itemitem modeling but with a fixed margin, where the margin is also set to . In (8), we add the useruser/itemitem modeling with adaptive margins (Section 4.3) to replace the fixed margins in the configuration of (7).
From the results in Table 3, we have several observations. First, from (1) and (2), we observe that by representing the user and item as Gaussian distributions and computing the distance between Gaussian distributions, the performance improves. This suggests that measuring the uncertainties of learned embeddings is significant. Second, from (1) and (3) along with (2) and (6), we observe that incorporating the adaptive margin generation module improves performance, irrespective of whether deterministic or Gaussian embeddings are used. These results demonstrate the effectiveness of the proposed margin generation module. Third, from (3), (4) and (5), we observe that our designed inputs (Eq. 11) for margin generation facilitate the production of appropriate margins compared to commonly used embedding concatenation or summation operations. Fourth, from (2), (3) and (6), we observe that (6) achieves better results than either (2) or (3), demonstrating that Gaussian embedddings and adaptive margin generation are compatible and can be combined to improve the model performance. Fifth, compared to (6), we observe that the inclusion of the useruser and itemitem terms in the objective function (7) leads to a large improvement in recommendation performance. This demonstrates that explicit useruser/itemitem modeling is essential and can be an effective supplement to infer user preferences. Sixth, from (7) and (8), we observe that adaptive margins also improve the modelling of the useruser/itemitem relations.
User  Positive  Sampled Movie  Margin 

405  Scream (Thriller)  Four Rooms (Thriller)  1.2752 
Toy Story (Animation)  12.8004  
French Kiss (Comedy)  Addicted to Love (Comedy)  2.6448  
Batman (Action)  12.4607  
66  Air Force One (Action)  GoldenEye (Action)  0.3216 
Crumb (Documentary)  5.0010  
The Godfather (Crime)  The Godfather II (Crime)  0.0067  
Terminator (SciFi)  3.6335 
5.8. Case Study
In this section, we conduct case studies to confirm whether the adaptive margin generation can produce appropriate margins. To achieve this purpose, we train our model on the MovieLens100K dataset. This dataset provides richer side information about movies (e.g., movie genres), making it easier for us to illustrate the results. Since we only focus on the adaptive margin generation, we use deterministic embeddings of users and items to avoid the interference of other modules. We randomly sample users from the dataset. For each user, we sample one item that the user has accessed as the positive item and two items the user did not access as negative items, where one item has a similar genre with the positive item and the other does not. The case study results are shown in Table
4.As shown in Table 4, our adaptive margin generation module tends to generate a smaller margin value when the negative movie has a similar genre with the positive movie, while generating larger margins when they are distinct. The generated margins thus encourage the model to embed items with a higher probability of being preferred closer to the user’s embedding.
6. Conclusion
In this paper, we propose a distancebased recommendation model for topK recommendation. Each user and item in our model are represented by Gaussian distributions with learnable parameters to handle the uncertainties. By incorporating an adaptive margin scheme, our model can generate finegrained margins for the training triples during the training procedure. To explicitly capture the useruser/itemitem relations, we adopt two margin ranking losses with adaptive margins to force similar user and item pairs to map closer together in the latent space. Experimental results on five realworld datasets validate the performance of our model, demonstrating improved performance compared to many stateoftheart methods and highlighting the effectiveness of the Gaussian embeddings and the adaptive margin generation scheme. The code is available at https://github.com/huaweinoah/noahresearch/tree/master/PMLAM.