Probabilistic Metric Learning with Adaptive Margin for Top-K Recommendation

Personalized recommender systems are playing an increasingly important role as more content and services become available and users struggle to identify what might interest them. Although matrix factorization and deep learning based methods have proved effective in user preference modeling, they violate the triangle inequality and fail to capture fine-grained preference information. To tackle this, we develop a distance-based recommendation model with several novel aspects: (i) each user and item are parameterized by Gaussian distributions to capture the learning uncertainties; (ii) an adaptive margin generation scheme is proposed to generate the margins regarding different training triplets; (iii) explicit user-user/item-item similarity modeling is incorporated in the objective function. The Wasserstein distance is employed to determine preferences because it obeys the triangle inequality and can measure the distance between probabilistic distributions. Via a comparison using five real-world datasets with state-of-the-art methods, the proposed model outperforms the best existing models by 4-22 recommendation.


User Diverse Preference Modeling by Multimodal Attentive Metric Learning

Most existing recommender systems represent a user's preference with a f...

Knowledge-Enhanced Top-K Recommendation in Poincaré Ball

Personalized recommender systems are increasingly important as more cont...

Collaborative Translational Metric Learning

Recently, matrix factorization-based recommendation methods have been cr...

Enhancing Factorization Machines with Generalized Metric Learning

Factorization Machines (FMs) are effective in incorporating side informa...

Adversarial Mahalanobis Distance-based Attentive Song Recommender for Automatic Playlist Continuation

In this paper, we aim to solve the automatic playlist continuation (APC)...

Modeling Personalized Item Frequency Information for Next-basket Recommendation

Next-basket recommendation (NBR) is prevalent in e-commerce and retail i...

Utility in Fashion with implicit feedback

Fashion preference is a fuzzy concept that depends on customer taste, pr...

1. Introduction

Internet users can easily access an increasingly vast number of online products and services, and it is becoming very difficult for users to identify the items that will appeal to them out of a plethora of candidates. To reduce information overload and to satisfy the diverse needs of users, personalized recommender systems have emerged and they are beginning to play an important role in modern society. These systems can provide personalized experiences, serve huge service demands, and benefit both the user-side and supply-side. They can: (i) help users easily discover products that are likely to interest them; and (ii) create opportunities for product and service providers to better serve customers and to increase revenue.

In all kinds of recommender systems, modeling the user-item interaction lies at the core. There are two common ways used in recent recommendation models to infer the user preference: matrix factorization (MF) and multi-layer perceptrons (MLPs). MF-based methods (e.g., 

(DBLP:conf/icdm/HuKV08; DBLP:conf/uai/RendleFGS09)) apply the inner product between latent factors of users and items to predict the user preferences for different items. The latent factors strive to depict the user-item relationships in the latent space. In contrast, MLP-based methods (e.g., (DBLP:conf/www/HeLZNHC17; DBLP:conf/recsys/CovingtonAS16)

) adopt (deep) neural networks to learn non-linear user-item relationships, which can generate better latent feature combinations between the embeddings of users and items 


However, both MF-based and MLP-based methods violate the triangle inequality (DBLP:conf/nips/Shrivastava014), and as a result may fail to capture the fine-grained preference information (DBLP:conf/www/HsiehYCLBE17). As a concrete example in (DBLP:conf/icdm/ParkKXY18), if a user accessed two items, MF or MLP-based methods will put both items close to the user, but will not necessarily put these two items close to each other, even if they share similar properties.

To address the limitations of MF and MLP-based methods, metric (distance) learning approaches have been utilized in the recommendation model (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/icdm/ParkKXY18; DBLP:conf/www/TayTH18; DBLP:journals/corr/LiZZQZHH20), as the distance naturally satisfies the triangle inequality. These techniques project users and items into a low-dimensional metric space, where the user preference is measured by the distance to items. Specifically, CML (DBLP:conf/www/HsiehYCLBE17) and LRML (DBLP:conf/www/TayTH18) are two representative models. CML minimizes the Euclidean distance between users and their accessed items, which facilitates user-user/item-item similarity learning. LRML incorporates a memory network to introduce additional capacity to learn relations between users and items in the metric space.

Figure 1. A motivating example of handling uncertainties of learned embeddings.

Although existing distance-based methods have achieved satisfactory results, we argue that there are still several avenues for enhancing performance. First, previous distance-based methods (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/icdm/ParkKXY18; DBLP:conf/www/TayTH18; DBLP:journals/corr/LiZZQZHH20) learn the user and item embeddings in a deterministic manner without handling the uncertainty. Relying solely on the learned deterministic embeddings may lead to an inaccurate understanding of user preferences. A motivating example is shown in Figure 1. After having accessed two songs and with different genres, the user may be placed between of and . If we only consider deterministic embeddings, should be a good candidate. But if we consider the embeddings from a probabilistic perspective, can be a better recommendation and it has the same genre as . Second, most of the existing methods  (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18) adopt the margin ranking loss (hinge loss) with a fixed margin as the hyper-parameter. We argue that the margin value should be adaptive and relevant to corresponding training samples. Furthermore, different training phases may need different magnitudes of margin values. Setting a fixed value may not be an optimal solution. Third, previous distance-based methods  (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18; DBLP:journals/corr/LiZZQZHH20) do not explicitly model user-user and item-item relationships. Closely-related users are very likely to share the same interests, and if two items have similar attributes it is likely that a user will favour both. When inferring a user’s preferences, we should explicitly take into account the user-user and item-item similarities.

To address the shortcomings, we propose a Probabilistic Metric Learning model with an Adaptive Margin (PMLAM) for Top-K recommendation. PMLAM consists of three major components: 1) a user-item interaction module, 2) an adaptive margin generation module, and 3) a user-user/item-item relation modeling module. To capture the uncertainties in the learned user and item embeddings, each user or item is parameterized with one Gaussian distribution, where the distribution related parameters are learned by our model. In the user-item interaction module, we adopt the Wasserstein distance to measure the distances between users and items, thus not only taking into account the means but also the uncertainties. In the adaptive margin generation module, we model the learning of adaptive margins as a bilevel (inner and outer) optimization problem (DBLP:journals/tec/SinhaMD18), where we build a proxy function to explicitly link the learning of margin related parameters with the outer objective function. In the user-user and item-item relation modeling module, we incorporate two margin ranking losses with adaptive margins for user-pairs and item-pairs, respectively, to explicitly encourage similar users or items to be mapped closer to one another in the latent space. We extensively evaluate our model by comparing with many state-of-the-art methods, using two performance metrics on five real-world datasets. The experimental results not only demonstrate the improvements of our model over other baselines but also show the effectiveness of the proposed modules.

To summarize, the major contributions of this paper are:

  • [leftmargin=*]

  • To capture the uncertainties in the learned user/item embeddings, we represent each user and item as a Gaussian distribution. The Wasserstein distance is leveraged to measure the user preference for items while simultaneously considering the uncertainty.

  • To generate an adaptive margin, we cast margin generation as a bilevel optimization problem, where a proxy function is built to explicitly update the margin generation related parameters.

  • To explicitly model the user-user and item-item relationships, we apply two margin ranking losses with adaptive margins to force similar users and items to map closer to one another in the latent space.

  • Experiments on five real-world datasets show that the proposed PMLAM model significantly outperforms the state-of-the-art methods for the Top-K recommendation task.

2. Related Work

In this section we summarize and discuss work that is related to our proposed top-K recommendation model.

In many real-world recommendation scenarios, user implicit data (DBLP:conf/kdd/WangWY15; DBLP:conf/kdd/MaKL19), e.g., clicking history, is more common than explicit feedback (DBLP:conf/icml/SalakhutdinovMH07) such as user ratings. The implicit feedback setting, also called one-class collaborative filtering (OCCF) (DBLP:conf/icdm/PanZCLLSY08), arises when only positive samples are available. To tackle this challenging problem, effective methods have been proposed.

Matrix Factorization-based Methods. Popularized by the Netflix prize competition, matrix factorization (MF) based methods have become a prominent solution for personalized recommendation (DBLP:journals/computer/KorenBV09). In (DBLP:conf/icdm/HuKV08)

, Hu et al. propose a weighted regularized matrix factorization (WRMF) model to treat all the missing data as negative samples, while heuristically assigning confidence weights to positive samples. Rendle et al. adopt a different approach in 

(DBLP:conf/uai/RendleFGS09), proposing a pair-wise ranking objective (Bayesian personalized ranking) to model the pair-wise relationships between positive items and negative items for each user, where the negative samples are randomly sampled from the unobserved feedback. To allow unobserved items to have varying degrees of importance, He et al. in (DBLP:conf/sigir/HeZKC16) propose to weight the missing data based on item popularity, demonstrating improved performance compared to WRMF.

Multi-layer Perceptron-based Methods. Due to their ability to learn more complex non-linear relationships between users and items, (deep) neural networks have been a great success in the domain of recommender systems. He et al. in (DBLP:conf/www/HeLZNHC17) propose a neural network-based collaborative filtering model, where a multi-layer perceptron is used to learn the non-linear user-item interactions. In (DBLP:conf/wsdm/WuDZE16; DBLP:conf/cikm/MaZWL18; DBLP:conf/wsdm/MaKWWL19)

, (denoising) autoencoders are employed to learn the user or item hidden representations from user implicit feedback. Autoencoder approaches can be shown to be generalizations of many of the MF methods 

(DBLP:conf/wsdm/WuDZE16). In (DBLP:conf/ijcai/XueDZHC17; DBLP:conf/ijcai/GuoTYLH17), conventional matrix factorization and factorization machine methods benefit from the representation ability of deep neural networks for learning either the user-item relationships or the interactions with side information. Graph neural networks (GNNs) have recently been incorporated in recommendation algorithms because they can learn and model relationships between entities (DBLP:conf/sigir/Wang0WFC19; DBLP:conf/icdm/SunZMCGTH19; DBLP:conf/aaai/MaMZSLC20).

Distance-based Methods. Due to their capacity to measure the distance between users and items, distance-based methods have been successfully applied in Top-K recommendation. In (DBLP:conf/www/HsiehYCLBE17), Hsieh et al. propose to compute the Euclidean distance between users and items for capturing fine-grained user preference. In (DBLP:conf/www/TayTH18), Tay et al. adopt a memory network (DBLP:conf/nips/SukhbaatarSWF15) to explicitly store the user preference in external memories. Park et al. in (DBLP:conf/icdm/ParkKXY18) apply a translation emb edding to capture more complex relations between users and items, where the translation embedding is learned from the neighborhood information of users and items. In (DBLP:conf/recsys/HeKM17), He et al. apply a distance metric to capture how the user interest shifts across sequential user-item interactions. In (DBLP:journals/corr/LiZZQZHH20), Li et al. propose to measure the trilateral relationship from both the user-centric and item-centric perspectives and learn adaptive margins for the central user and positive item.

Our proposed recommendation model is different in key ways from all of the methods identified above. In contrast to the matrix factorization (DBLP:conf/icdm/HuKV08; DBLP:conf/uai/RendleFGS09; DBLP:conf/sigir/HeZKC16) and neural network methods (DBLP:conf/www/HeLZNHC17; DBLP:conf/wsdm/WuDZE16; DBLP:conf/cikm/MaZWL18; DBLP:conf/wsdm/MaKWWL19; DBLP:conf/ijcai/XueDZHC17; DBLP:conf/ijcai/GuoTYLH17; DBLP:conf/sigir/Wang0WFC19), we employ the Wasserstein distance that obeys the triangle inequality. This is important for ensuring that users with similar interaction histories are mapped close together in the latent space. In contrast to most of the prior distance-based approaches, (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18; DBLP:journals/corr/LiZZQZHH20)

, we employ parameterized Gaussian distributions to represent each user and item in order to capture the uncertainties of learned user preferences and item properties. Moreover, we formulate a bilevel optimization problem and incorporate a neural network to generate adaptive margins for the commonly applied margin ranking loss function.

3. Problem Formulation

The recommendation task considered in this paper takes as input the user implicit feedback. For each user , the user preference data is represented by a set that includes the items she preferred, e.g., , where is an item index in the dataset. The top- recommendation task in this paper is formulated as: given the training item set , and the non-empty test item set (requiring that and ) of user , the model must recommend an ordered set of items such that and . Then the recommendation quality is evaluated by a matching score between and , such as Recall@.

4. Methodology

In this section, we present the proposed model shown in Fig. 2. We first introduce the user-item interaction module, which captures the user-item interactions by calculating the Wasserstein distance between users’ and items’ distributions. Then we describe the adaptive margin generation module, which generates adaptive margins during the training process. Next, we present the user-user and item-item relation modeling module. Lastly, we specify the objective function and explain the training process of the proposed model.

Figure 2. The demonstration of the proposed model. denotes the combined optimization regarding and . and follow the same manner with .

4.1. Wasserstein Distance for Interactions

Previous works (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18) use the user and item embeddings in a deterministic manner and do not measure or learn the uncertainties of user preferences and item properties. Motivated by probabilistic matrix factorization (PMF) (DBLP:conf/nips/SalakhutdinovM07), we represent each user or item as a single Gaussian distribution. In contrast to PMF, which applies Gaussian priors on user and item embeddings, users and items in our model are parameterized by Gaussian distributions, where the means and covariances are directly learned. Specifically, the latent factors of user and item are represented as:


Here and

are the learned mean vector and covariance matrix of user

, respectively; and are the learned mean vector and covariance matrix of item . To limit the complexity of the model and reduce the computational overhead, we assume that the embedding dimensions are uncorrelated. Thus, is a diagonal covariance matrix that can be represented as a vector. Specifically, and , where is the dimension of the latent space.

Widely used distance metrics for deterministic embeddings, like the Euclidean distance, do not properly measure the distance between distributions. Since users and items are represented by probabilistic distributions, we need a distance measure between distributions. Among the commonly used distance metric between distributions, we adopt the Wasserstein distance to measure the user preference for an item. The reasons are twofold: i) the Wasserstein distance satisfies all the properties a distance should have; and ii) the Wasserstein distance has a simple form when calculating the distance between Gaussian distributions (DBLP:conf/nips/MallastoF17). Formally, the

-th Wasserstein distance between two probability measures

and on a Polish metric space (srivastava2008course) is defined (Givens_1984):

where is an arbitrary distance with moment (casella2002statistical) for a deterministic variable, ; and denotes the set of all measures on which admit and as marginals. When , the -th Wasserstein distance preserves all properties of a metric, including both symmetry and the triangle inequality.

The calculation of the general Wasserstein distance is computation-intensive (DBLP:conf/uai/XieWWZ19). To reduce the computational cost, we use Gaussian distributions for the latent representations of users and items. Then when , the -nd Wasserstein distance (abbreviated as ) has a closed form solution, thus making the calculation process much faster. Specifically, we have the following formula to calculate the distance between user and item  (Givens_1984):


In our setting, we focus on diagonal covariance matrices, thus . For simplicity, we use to denote the left hand side of Eq. 2. Then Eq. 2 can be simplified as:


According to the above equation, the time complexity of calculating distance between the latent representations of users and items is linear with the embedding dimension.

4.2. Adaptive Margin in Margin Ranking Loss

To learn the distance-based model, most of the existing works (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18) apply the margin ranking loss to measure the user preference difference between positive items and negative items. Specifically, the margin ranking loss makes sure the distance between a user and a positive item is less than the distance between the user and a negative item by a fixed margin . The loss function is:


where is an item that user has accessed, and is a randomly sampled item treated as the negative example, and . Thus, represents a training triplet.

The safe margin in the margin ranking loss is a crucial hyper-parameter that has a major impact on the model performance. A fixed margin value may not achieve satisfactory performance. First, using a fixed value does not allow for adaptation to distinguish the properties of the training triplets. For example, some users have broad interests, so the margins for these users should not be so large as to make potential preferred items too far from the user. Other users have very focused interests, and it is desirable to have a larger margin to avoid recommending items that are not directly within the focus. Second, in different training phases, the model may need different magnitudes of margins. For instance, in the early stage of training, the model is not reliable enough to make strong predictions on user preferences, and thus imposing a large margin risk pushing potentially positive items too far from a user. Third, to achieve satisfactory performance, the selection of a fixed margin involves tedious hyper-parameter tuning. Based on these considerations, we conclude that setting a fixed margin value for all training triplets may limit the model expressiveness.

To address the problems outlined above, we propose an adaptive margin generation scheme which generates margins according to the training triplets. Formally, we formulate the margin ranking loss with an adaptive margin as:


Here is a function that generates the specific margin based on the corresponding user and item embeddings and is the learnable set of parameters associated with . Then we could consider optimizing and simultaneously:


Unfortunately, directly minimizing the objective function as in Eq. 6 does not achieve the desired purpose of generating suitable adaptive margins. Since the margin-related term explicitly appears in the loss function, constantly decreasing the value of the generated margin is the straightforward way to reduce the loss. As a result all generated margins have very small values or are set to zero, leading to unsatisfactory results. In other words, the direct optimization of with respect to harms the optimization of .

4.2.1. Bilevel Optimization

We model the learning of recommendation models and the generation of adaptive margins as a bilevel optimization problem (DBLP:journals/anor/ColsonMS07):


Here contains the model parameters and . The objective function attempts to minimize with respect to while the objective function optimizes with respect to through . For simplicity, the of in is set to for guiding the learning of . Thus, we can have an alternating optimization to learn and :

  • update phase (Inner Optimization): Fix and optimize .

  • update phase (Outer Optimization): Fix and optimize .

4.2.2. Approximate Gradient Optimization

As most existing models utilize gradient-based methods for optimization, a simple approximation strategy with less computation is introduced as follows:


In this expression, denotes the current parameters including and , and is the learning rate for one step of inner optimization. Related approximations have been validated in (DBLP:conf/wsdm/Rendle12; DBLP:conf/iclr/LiuSY19). Thus, we can define a proxy function to link with the outer optimization:


For simplicity, we use two optimizers and to update and , respectively. The iterative procedure is shown in Alg. 1.

Initialize optimizers and ;
while not converged do
       Update (fix ):
       Update (fix ):
end while
Algorithm 1 Iterative Optimization Procedure

4.2.3. The design of

We parameterize with a neural network to generate the margin based on :


Here and are learnable parameters in , is the input to generate the margin, and is the generated margin of

. The activation function

softplus guarantees . To promote a discrimative that reflects the relation between and , the following form can be a fine-grained indicator:


Here is introduced to mimic the calculation of Euclidean distance without summing over all dimensions. denotes element-wise subtraction and denotes the concatenation operation. To improve the robustness of , we take as inputs the sampled embeddings and

. To perform backpropagation from

and , we adopt the reparameterization trick (DBLP:journals/corr/KingmaW13) for Eq. 1:


where and is element-wise muliplication.

4.3. User-User and Item-Item Relations

It is important to model the relationships between pairs of users or pairs of items when developing recommender systems and strategies for doing so effectively have been studied for many years (DBLP:conf/www/SarwarKKR01; DBLP:conf/kdd/KabburNK13; DBLP:reference/sp/NingDK15). For example, item-based collaborative filtering methods use item rating vectors to calculate the similarities between the items. Closely-related users or items may share the same interests or have similar attributes. For a certain user, items similar to the user’s preferred items are potential recommendation candidates.

Despite this intuition, previous distance-based recommendation methods (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18) do not explicitly take the user-user or item-item relationships into consideration. As a result of relying primarily on user-item information, the systems may fail to generate appropriate user-user or item-item distances. To model the relationships between similar users or items, we employ two ranking margin losses with adaptive margins to encourage similar users or items to be mapped closer together in the latent space. Formally, the similarities between users or items are calculated from the user implicit feedback, which can be represented by a binary user-item interaction matrix. We set a threshold on the calculated similarities to identify the similar users and items for a specific user and item , respectively, denoted as and . We adopt the following losses for user pairs and item pairs, respectively:


where is a randomly sampled user in Eq. 13 and a randomly sampled item in Eq. 14. denotes the user-user relation and denotes the item-item relation. We use and to update and , respectively, which are the same as in Alg. 1. We denote the indicator in Eq. 11 as , then we generate and following the procedure described by Eq. 11.

4.4. Model Training

Let us denote the losses and to capture the interactions between users and items. Then we combine the loss functions presented in Section 4.3 to optimize the proposed model:


where is a regularization parameter. We follow the same training scheme of Section 4.2 to train Eq. 15

. To mitigate the curse of dimensionality issue 

(DBLP:conf/nips/BordesUGWY13) and prevent overfitting, we bound all the user/item embeddings within a unit sphere after each mini-batch training: and . When minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with back-propagation.

Recommendation Phase. In the testing phase, for a certain user , we compute the distance between user and each item in the dataset. Then the items that are not in the training set and have the shortest distances are recommended to user .

5. Experiments

In this section, we evaluate the proposed model, comparing with the state-of-the-art methods on five real-world datasets.

5.1. Datasets

The proposed model is evaluated on five real-world datasets from various domains with different sparsities: Books, Electronics and CDs (DBLP:conf/www/HeM16), Comics (DBLP:conf/recsys/WanM18) and Gowalla (DBLP:conf/kdd/ChoML11). The Books, Electronics and CDs datasets are adopted from the Amazon review dataset with different categories, i.e., books, electronics and CDs. These datasets include a significant amount of user-item interaction data, e.g., user ratings and reviews. The Comics dataset was collected in late 2017 from the GoodReads website with different genres, and we use the genres of comics. The Gowalla dataset was collected worldwide from the Gowalla website (a location-based social networking website) over the period from February 2009 to October 2010. In order to be consistent with the implicit feedback setting, we retain any ratings no less than four (out of five) as positive feedback and treat all other ratings as missing entries for all datasets. To filter noisy data, we only include users with at least ten ratings and items with at least five ratings. Table 1 shows the data statistics.

We employ five-fold cross-validation to evaluate the proposed model. For each user, the items she accessed are randomly split into five folds. We pick one fold each time as the ground truth for testing, and the remaining four folds constitute the training set. The average results over the five folds are reported.

Dataset #Users #Items #Interactions Density
Books 77,754 66,963 2,517,343 0.048%
Electronics 40,358 28,147 524,906 0.046%
CDs 24,934 24,634 478,048 0.079%
Comics 37,633 39,623 2,504,498 0.168%
Gowalla 64,404 72,871 1,237,869 0.034%
Table 1. The statistics of the datasets.

5.2. Evaluation Metrics

We evaluate all models in terms of Recall@k and NDCG@k. For each user, Recall@k (R@k) indicates the percentage of her rated items that appear in the top recommended items. NDCG@k (N@k) is the normalized discounted cumulative gain at , which takes the position of correctly recommended items into account.

5.3. Methods Studied

To demonstrate the effectiveness of our model, we compare to the following recommendation methods.
Classical methods for implicit feedback:

  • [leftmargin=*]

  • BPRMF, Bayesian Personalized Ranking-based Matrix Factorization (DBLP:conf/uai/RendleFGS09), which is a classic method for learning pairwise personalized rankings from user implicit feedback.

Classical neural-based recommendation methods:

  • [leftmargin=*]

  • NCF, Neural Collaborative Filtering (DBLP:conf/www/HeLZNHC17), which combines the matrix factorization (MF) model with a multi-layer perceptron (MLP) to learn the user-item interaction function.

  • DeepAE, the deep autoencoder (DBLP:conf/cikm/MaZWL18), which utilizes a three-hidden-layer autoencoder with a weighted loss function.

State-of-the-art distance-based recommendation methods:

  • [leftmargin=*]

  • CML, Collaborative Metric Learning (DBLP:conf/www/HsiehYCLBE17), which learns a metric space to encode the user-item interactions and to implicitly capture the user-user and item-item similarities.

  • LRML, Latent Relational Metric Learning (DBLP:conf/www/TayTH18), which exploits an attention-based memory-augmented neural architecture to model the relationships between users and items.

  • TransCF, Collaborative Translational Metric Learning (DBLP:conf/icdm/ParkKXY18), which employs the neighborhood of users and items to construct translation vectors capturing the intensity of user–item relations.

  • SML, Symmetric Metric Learning with adaptive margin (DBLP:journals/corr/LiZZQZHH20), which measures the trilateral relationship from both the user- and item-centric perspectives and learns adaptive margins.

The proposed method:

  • [leftmargin=*]

  • PMLAM, the proposed model, which represents each user and item as Gaussian distributions to capture the uncertainties in user preferences and item properties, and incorporates an adaptive margin generation mechanism to generate the margins based on the sampled user-item triplets.

5.4. Experiment Settings

In the experiments, the latent dimension of all the models is set to for a fair comparison. All the models adopt the same negative sampling strategy with the proposed model, unless otherwise specified. For BPRMF, the learning rate is set to and the regularization parameter is set to . With these parameters, the model can achieve good results. For NCF, we follow the same model structure as in the original paper (DBLP:conf/www/HeLZNHC17). The learning rate is set to and the batch size is set to . For DeepAE, we adopt the same model structure employed in the author-provided code and set the batch size to . The weight of the positive items is selected from by a grid search and the weights of all other items are set to as recommended in (DBLP:conf/icdm/HuKV08). For CML, we use the authors’ implementation to set the margin to and the regularization parameter to . For LRML, the learning rate is set to , and the number of memories is selected from by a grid search. For TransCF, we follow the settings in the original paper to select and set the margin to and batch size to , respectively. For SML, we follow the author’s code to set the user and item margin bound to , to and to , respectively.

For our model, both the learning rate and are set to . For the and datasets, we randomly sample unobserved users or items as negative samples for each user and positive item. This number is reduced to for the other datasets to speed up the training process. The batch size is set to for all datasets. The dimension is set to

. The user and item embeddings are initialized by drawing each vector element independently from a zero-mean Gaussian distribution with a standard deviation of

. Our experiments are conducted with PyTorch running on GPU machines (Nvidia Tesla P100).

Books 0.0553 0.0568 0.0817 0.0730 0.0565 0.0754 0.0581 0.0885** 8.32%
Electronics 0.0243 0.0277 0.0253 0.0395 0.0299 0.0353 0.0279 0.0469*** 18.73%
CDs 0.0730 0.0759 0.0736 0.0922 0.0822 0.0851 0.0793 0.1129*** 22.45%
Comics 0.1966 0.2092 0.2324 0.1934 0.1795 0.1967 0.1713 0.2417 4.00%
Gowalla 0.0888 0.0895 0.1113 0.0840 0.0935 0.0824 0.0894 0.1331*** 19.58%
Books 0.0391 0.0404 0.0590 0.0519 0.0383 0.0542 0.0415 0.0671** 13.72%
Electronics 0.0111 0.0125 0.0134 0.0178 0.0117 0.0148 0.0105 0.0234*** 31.46%
CDs 0.0383 0.0402 0.0411 0.0502 0.0420 0.0461 0.0423 0.0619*** 23.30%
Comics 0.2247 0.2395 0.2595 0.2239 0.1922 0.2341 0.1834 0.2753* 6.08%
Gowalla 0.0806 0.0822 0.0944 0.0611 0.0670 0.0611 0.0823 0.0984* 4.23%
Table 2. The performance comparison of all methods in terms of Recall@10 and NDCG@10. The best performing method is boldfaced. The underlined number is the second best performing method. , , indicate the statistical significance for , , and

, respectively, compared to the best baseline method based on the paired t-test.

Improv. denotes the improvement of our model over the best baseline method.

5.5. Implementation Details

To speed up the training process, we implement a two-phase sampling strategy. We sample a number of candidates, e.g., 500, of negative samples for each user every 20 epochs to form a candidate set. During the next 20 epochs, the negative samples of each user are sampled from her candidate set. This strategy can be implemented using multiple processes to further reduce the training time.

Since none of the processed datasets has inherent user-user/item-item information, we treat the user-item interaction as a user-item matrix and compute the cosine similarity for the user and item pairs, respectively 

(DBLP:conf/www/SarwarKKR01). We set a threshold, e.g., on Amazon and Gowalla datasets and on the Comics dataset, to select the neighbors. These thresholds are chosen to ensure a reasonable degree of connectivity in the constructed graphs.

(a) Recall@k on Books
(b) NDCG@k on Books
(c) Recall@k on Electronics
(d) NDCG@k on Electronics
(e) Recall@k on CDs
(f) NDCG@k on CDs
(g) Recall@k on Comics
(h) NDCG@k on Comics
(i) Recall@k on Gowalla
(j) NDCG@k on Gowalla
Figure 3. The performance comparison on all datasets.

5.6. Performance Comparison

The performance comparison is shown in Figure 3 and Table 2. Based on these results, we have several observations.

Observations about our model. First

, the proposed model, PMLAM, achieves the best performance on all five datasets with both evaluation metrics, which illustrates the superiority of our model.


, PMLAM outperforms SML. Although SML has an adaptive margin mechanism, it is achieved by having a learnable scalar margin for each user and item and adding a regularization term to prevent the learned margins from being too small. It can be challenging to identify an appropriate regularization weight via hyperparameter tuning. By contrast, PMLAM formulates the adaptive margin generation as a bilevel optimization, avoiding the additional regularization. PMLAM employs a neural network to generate the adaptive margin, so the number of parameters related to margin generation does not increase with the number of users or items.

Third, PMLAM achieves better performance than TransCF. One major reason is that TransCF only considers the items rated by a user and the users who rated an item as the neighbors of the user and item, respectively, which neglects the user-user/item-item relations. PMLAM models the user-user/item-item relations by two margin ranking losses with adaptive margins.
Fourth, PMLAM makes better recommendations than CML and LRML. These methods apply a fixed margin for all user-item triplets and do not measure or model the uncertainty of learned user/item embeddings. PMLAM represents each user and item as a Gaussian distribution, where the uncertainties of learned user preferences and item properties are captured by the covariance matrices.
Fifth, PMLAM outperforms NCF and DeepAE. These are MLP-based recommendation methods with the ability to capture non-linear user-item relationships, but they violate the triangle inequality when modeling user-item interaction. As a result, they can struggle to capture the fine-grained user preference for particular items (DBLP:conf/www/HsiehYCLBE17).

Other observations. First, all of the results reported for the Comics dataset are considerably better than those for the other datasets. The other four datasets are sparser and data sparsity negatively impacts recommendation performance.
Second, CML, LRML and TransCF perform better than SML on most of the datasets. The adaptive margin regularization term in SML struggles to adequately counterbalance SML’s tendency to reduce the loss by imposing small margins. Although it is reported that SML outperforms CML, LRML and TransCF in (DBLP:journals/corr/LiZZQZHH20), the experiments are conducted on three relatively small-scale datasets with only several thousands of users and items. We experiment with much larger datasets; identifying a successful regularization setting appears to be more difficult as the number of users increases.
Third, TransCF outperforms LRML on most of the datasets. One possible reason is that TransCF has a more effective translation embedding learning mechanism, which incorporates the neighborhood information of users and items. TransCF also has a regularization term to further pull positive items closer to the anchor user.
Fourth, CML achieves better performance than LRML on most of the datasets. CML integrates the weighted approximate-rank pairwise (WARP) weighting scheme (DBLP:journals/ml/WestonBU10) in the loss function to penalize lower-ranked positive items. The comparison between CML and LRML in (DBLP:conf/www/TayTH18) removes this component of CML. The WARP scheme appears to play an important role in improving CML’s performance.
Fifth, DeepAE outperforms NCF. The heuristic weighting function of DeepAE can impose useful penalties to errors that occur during training when positive items are assigned lower prediction scores.

Architecture CDs Electronics
R@10 N@10 R@10 N@10
(1) + Deter_Emb 0.0721 0.0371 0.0241 0.0090
(2) + Gauss_Emb 0.0815 0.0434 0.0296 0.0110
(3) + Deter_Emb 0.0777 0.0415 0.0338 0.0125
(4) - + Deter_Emb 0.0408 0.0204 0.0139 0.0055
(5) - + Deter_Emb 0.0311 0.0158 0.0050 0.0018
(6) + Gauss_Emb 0.0856 0.0454 0.0365 0.0155
(7) + + 0.0966 0.0526 0.0429 0.0189
(8) PMLAM 0.1129 0.0619 0.0469 0.0234
Table 3. The ablation analysis on the CDs and Electronics datasets. cat denotes the concatenation operation and add denotes the addition operation.

5.7. Ablation Analysis

To verify and assess the relative effectiveness of the proposed user-item interaction module, the adaptive margin generation module, and the user-user/item-item relation module, we conduct an ablation study. Table 3 reports the performance improvement achieved by each module of the proposed model. Note that we compute Euclidean distances between deterministic embeddings. In (1), which serves as a baseline, we use the hinge loss with a fixed margin (Eq. 4) on deterministic embeddings of users and items to capture the user-item interaction ( is set to which is commonly used in (DBLP:conf/www/HsiehYCLBE17; DBLP:conf/www/TayTH18; DBLP:conf/icdm/ParkKXY18)). In (2), as an alternative baseline, we apply the same hinge loss as in (1), but replace the deterministic embeddings with parameterized Gaussian distributions (Section 4.1). In (3), we use the adaptive margin generation module (Section 4.2) to generate the margins for deterministic embeddings. In (4), we concatenate the deterministic embeddings of to generate instead of using Eq. 11. In (5), we sum the deterministic embeddings of to generate instead of using Eq. 11. In (6), we combine (2) and (3) to generate the adaptive margins for Gaussian embeddings. In (7), we augment (6) with user-user/item-item modeling but with a fixed margin, where the margin is also set to . In (8), we add the user-user/item-item modeling with adaptive margins (Section 4.3) to replace the fixed margins in the configuration of (7).

From the results in Table 3, we have several observations. First, from (1) and (2), we observe that by representing the user and item as Gaussian distributions and computing the distance between Gaussian distributions, the performance improves. This suggests that measuring the uncertainties of learned embeddings is significant. Second, from (1) and (3) along with (2) and (6), we observe that incorporating the adaptive margin generation module improves performance, irrespective of whether deterministic or Gaussian embeddings are used. These results demonstrate the effectiveness of the proposed margin generation module. Third, from (3), (4) and (5), we observe that our designed inputs (Eq. 11) for margin generation facilitate the production of appropriate margins compared to commonly used embedding concatenation or summation operations. Fourth, from (2), (3) and (6), we observe that (6) achieves better results than either (2) or (3), demonstrating that Gaussian embedddings and adaptive margin generation are compatible and can be combined to improve the model performance. Fifth, compared to (6), we observe that the inclusion of the user-user and item-item terms in the objective function (7) leads to a large improvement in recommendation performance. This demonstrates that explicit user-user/item-item modeling is essential and can be an effective supplement to infer user preferences. Sixth, from (7) and (8), we observe that adaptive margins also improve the modelling of the user-user/item-item relations.

User Positive Sampled Movie Margin
405 Scream (Thriller) Four Rooms (Thriller) 1.2752
Toy Story (Animation) 12.8004
French Kiss (Comedy) Addicted to Love (Comedy) 2.6448
Batman (Action) 12.4607
66 Air Force One (Action) GoldenEye (Action) 0.3216
Crumb (Documentary) 5.0010
The Godfather (Crime) The Godfather II (Crime) 0.0067
Terminator (Sci-Fi) 3.6335
Table 4. A case study of the generated margin of sampled training triplets. The movie genre label is from the dataset.

5.8. Case Study

In this section, we conduct case studies to confirm whether the adaptive margin generation can produce appropriate margins. To achieve this purpose, we train our model on the MovieLens-100K dataset. This dataset provides richer side information about movies (e.g., movie genres), making it easier for us to illustrate the results. Since we only focus on the adaptive margin generation, we use deterministic embeddings of users and items to avoid the interference of other modules. We randomly sample users from the dataset. For each user, we sample one item that the user has accessed as the positive item and two items the user did not access as negative items, where one item has a similar genre with the positive item and the other does not. The case study results are shown in Table 


As shown in Table 4, our adaptive margin generation module tends to generate a smaller margin value when the negative movie has a similar genre with the positive movie, while generating larger margins when they are distinct. The generated margins thus encourage the model to embed items with a higher probability of being preferred closer to the user’s embedding.

6. Conclusion

In this paper, we propose a distance-based recommendation model for top-K recommendation. Each user and item in our model are represented by Gaussian distributions with learnable parameters to handle the uncertainties. By incorporating an adaptive margin scheme, our model can generate fine-grained margins for the training triples during the training procedure. To explicitly capture the user-user/item-item relations, we adopt two margin ranking losses with adaptive margins to force similar user and item pairs to map closer together in the latent space. Experimental results on five real-world datasets validate the performance of our model, demonstrating improved performance compared to many state-of-the-art methods and highlighting the effectiveness of the Gaussian embeddings and the adaptive margin generation scheme. The code is available at