Nowadays, personalized recommendation is ubiquitous in various applications, such as video recommendation in YouTube (Covington et al., 2016), product recommendation in Amazon (Smith and Linden, 2017), and news recommendation in Bing (Wu et al., 2019). The goal of recommendation is to predict whether a user will interact (e.g., click or purchase) with an item and thus help users discover potential items of interests. Collaborative filtering (CF) (Su and Khoshgoftaar, 2009) is a fundamental task in recommendation that leverages the collaborative information among users and items to predict users’ preferences on candidate items. The simplicity and effectiveness make it one of the most popular techniques in recommender systems.
Generally, the learning process of a CF model can be separated to three major components, including interaction encoder, loss function, and the negative sampling strategy used when only positive (i.e.,implicit) feedbacks are available. Most existing studies focus on the design of more powerful interaction encoders to capture collaborative signals among users and items. Especially, the prevalence of deep learning motivates a rich line of work that applies various neural networks to CF, including multi-layer perceptrons (MLPs)(He et al., 2017; Covington et al., 2016), auto-encoders (Liang et al., 2018), attention networks (Chen et al., 2017), transformers (Sun et al., 2019), graph neural networks (GNNs) (He et al., 2020), and so on. Nevertheless, these models tend to become more and more complex to show performance improvements. This somehow limits their practical applicability in industrial recommender systems that demand high efficiency.
On the contrary, few research efforts have been devoted to investigating the impacts of the latter two components. Specifically, while multiple loss functions have been used in CF, such as Bayesian personalized ranking (BPR) loss (Rendle et al., 2009), binary cross-entropy loss (He et al., 2017), softmax cross-entropy loss (Covington et al., 2016), pairwise hinge loss (Hsieh et al., 2017), and mean square error loss (Chen et al., 2020a), there is still a lack of systematic evaluation and comparisons among different loss functions. Furthermore, many recent GNN-based studies (Wang et al., 2019; He et al., 2020; Wang et al., 2020; Sun et al., 2020b, a) experiment with the BPR loss (Rendle et al., 2009) and simply set the negative sampling ratio to a small value (i.e., sampling 1 or 10 negative samples per positive user-item pair). In this way, they can justify the superiority of their proposed interaction encoders, but they neglect the importance of loss functions and negative sampling in the learning of CF models.
In fact, we empirically observed that training with the BPR loss and a small negative sampling ratio results in inferior results for many CF models. In this paper, we show that choosing a suitable loss function and a proper number of negative samples plays an equal or more important role than an interaction encoder. Towards this goal, we systematically compare multiple commonly-used loss functions and also investigate the impact of negative sampling ratio on each loss function. Moreover, inspired by the widely used contrastive loss (Hadsell et al., 2006; Yang et al., 2019)
in computer vision, we propose a cosine contrastive loss (CCL) tailored for CF. Our CCL loss optimizes the embedding by maximizing the cosine similarity of a positive user-item pair, while minimizing the similarity of a negative pair to a certain margin. Surprisingly, we found that even a simple model (e.g., MF), if paired with our proposed CCL loss, is sufficient to surpass many sophisticated state-of-the-art models.
These findings raise questions about whether the current baselines are strong enough to verify the performance improvements of the state-of-the-art CF models, and how much these sophisticated models have really improved. Our work aims to answer these questions. We argue that the current baselines might not be strong enough, which could mislead us to overestimate the real improvements of many new CF models. Instead of criticizing the contributions of any existing work, the main goal of our work is to build a simple and strong baseline model to foster future research on CF.
). We build Simplex as a unified model that integrates matrix factorization and user behaviour modeling. Specifically, it comprises a behavior aggregation layer (e.g., average pooling) to obtain a user’s preference vector from the historically interacted items, and then fuses with the user embedding vector via a weighted sum. More importantly, SimpleX is optimized with our CCL loss and a large negative sampling ratio. Although the interaction encoder of SimpleX seems quite simple and might not be novel at all, we show that it could serve as a super-strong baseline model and have great potential for industrial applications because of its high efficiency.
For evaluation, we conduct comprehensive experiments on 11 benchmark datasets in total and compare with a total of 29 popular CF models of different types. The results show that SimpleX outperforms most sophisticated state-of-the-art methods by a large margin (up to 48.5% improvement in NDCG@20 over LightGCN (He et al., 2020) on Amazon-Books). We also empirically compare the performance of six representative loss functions and investigate the impact of different negative sampling ratios on each loss function, which demonstrates the superiority of our proposed CCL loss for CF tasks. Furthermore, we evaluate the efficiency of SimpleX, which shows more than 10x speedup over the simplified GNN-based CF model, LightGCN (He et al., 2020). We hope that our work could not only serve as a simple and strong baseline to foster future research on CF, but also attract more research efforts towards the co-design of interaction encoders, loss functions, and negative sampling strategies.
The main contributions of our work are summarized as follows:
We highlight the importances of loss functions and negative sampling in CF, and propose the cosine contrastive loss accordingly.
We present a simple and strong baseline model, SimpleX, which could even attain much better performance than most sophisticated state-of-the-art models.
We perform experiments on 11 benchmark datasets and compare SimpleX with 29 existing CF models to show its superiority in terms of both effectiveness and efficiency.
2. Background and Related Work
In this section, we first give a formulation of collaborative filtering and point out three important aspects in CF modeling. We then summarize different categories of CF models.
2.1. Formulation of CF
The research of collaborative filtering includes implicit CF and explicit CF. Implicit CF models learn from implicit feedback data, e.g., click, visit, and purchase, while explicit CF models learn from explicit feedbacks such as ratings. In this work, we focus on implicit CF since it is more common in real recommendation scenarios. Besides, it is also easy to transform explicit feedback to implicit feedback via binarization. In implicit CF, a matrixis used to denote the user-item interactions, where if user u has observed interaction with item i and otherwise.
As mentioned in Section 1, we highlight three vital aspects that have a large impact to the learning process of CF models:
(1) Interaction Encoder. The function of the interaction encoder is to learn embeddings for each user and each item, which capture collaborative signals in the interaction matrix that reflect the behavioral similarity between users (or items). It is undoubtedly the core of CF models and has been well studied. We give a brief summary of interaction encoders in section 2.2.
(2) Loss Function. In general, there are two common types of loss functions in CF. Pointwise loss functions such as binary cross-entropy (BCE) and mean square error (MSE) treat the learning process as a binary classification or a regression task. Pairwise loss such as Bayesian personalized ranking loss (BPR) is optimized to make the similarities of positive user-item pairs larger than the negative ones.
(3) Negative Sampling. Since there are a lot of unobserved entries, in most cases we need to perform negative sampling to improve training efficiency. A few studies have been made to improve the uniform random sampling for recommendation, including mining informative negative samples (e.g., RNS (Ding et al., 2019), and NBPO (Yu and Qin, 2020b)), tackling the selection bias of implicit user feedback (e.g., MSN (Yang et al., 2020)) and so on. In this work, we mainly investigate the influence of the negative sampling ratio. The existing studies are complementary to our work and potential to be applied to our SimpleX model for further improvement.
2.2. Summary of representative CF methods
We summarize representative CF methods into four categories:
(1) MF-based methods. Matrix factorization (MF) based algorithms decompose the user-item interaction matrix into two low-dimensional latent matrices for user and item representation. Due to its effectiveness, MF has been wildly studied in CF. Manotumruksa et al. proposed GRMF (Manotumruksa et al., 2017) that smoothed MF through adding the graph Laplacian regularizer to introduce graph information. Yang et al. devised a unified and efficient method called HOP-Rec (Yang et al., 2018) that incorporated both MF and graph-based models for implicit CF. Chen et al. designed ENMF (Chen et al., 2020a), which is an efficient MF-based CF model with modified MSE loss function. It can be optimized efficiently without negative sampling for implicit feedback.
(2) Autoencoder-based methods.Autoencoder-based CF methods leverage the autoencoder network architectures to learn item embeddings. Such models are suitable to perform inductive recommendation, i.e., learning from one group of users while performing recommendation for another group of users with the same candidate items. For example, Liang et al. proposed Mult-VAE (Liang et al., 2018), which applied variational autoencoder (VAE) for CF. Ma et al. proposed MacridVAE (Ma et al., 2019b) by disentangling user intents behind user-item and leveraging -VAE to simulate the generative process of a user’s personal history interactions. Steck et al. designed a linear model called EASER (Steck, 2019) that is geared toward sparse data, in particular implicit feedback data, for the recommendation.
(3) GNN-based methods. Since the interaction data can be naturally modelled as a user-item bipartite graph, recent studies propose graph neural network (GNN) based CF models and report state-of-the-art performance. GNN-based methods model the recommendation as the link prediction task between user nodes and item nodes, where the higher-order collaborative signals can be effectively captured through multi-layers message passing. Ying et al. proposed PinSage (Ying et al., 2018) that improved GraphSage (Hamilton et al., 2017) to model the item-item relationships for Pinterest. Wang et al. devised NGCF (Wang et al., 2019) that explicitly encoded the collaborative signals as high-order connectivities by performing embedding propagation. He et al. proposed LightGCN (He et al., 2020), which removed the feature transformation and non-linear activation in NGCF and improved both performance and efficiency. These successful applications of GNN in recommendation further inspire many good studies, including BGCF (Sun et al., 2020a) which models the uncertainty in the user-item graph with bayesian graph neural networks, DGCF (Wang et al., 2020) which models a distribution over intents for each user-item interaction, NIA-GCN (Sun et al., 2020b) and NGAT4Rec (Song et al., 2020) that learn neighborhood relationships, and SGL-ED (Wu et al., 2021), DHCF (Ji et al., 2020), LCFN (Yu and Qin, 2020a), and so on.
(4) Others. We put methods that do not fall into the first three categories into this “Others” category. Here we list some representative models such as SLIM (Ning and Karypis, 2011) which is a simple linear model that combines the advantages of neighborhood- and model-based CF approaches, MLPs-based NeuMF (He et al., 2017) and YouTubet (Covington et al., 2016), memory network-based CMN (Ebesu et al., 2018), metric learning-based CML (Hsieh et al., 2017), and NBPO (Yu and Qin, 2020b) that leverages noisy-label robust learning techniques.
In this section, we first present our cosine contrastive loss and the SimpleX model architecture for CF. We then analyze its connections to other existing models.
3.1. Cosine Contrastive Loss
In the CF literature, many different loss functions have been employed, including BPR loss (Rendle et al., 2009), binary cross-entropy (He et al., 2017), softmax cross-entropy (Covington et al., 2016), pairwise hinge loss (Hsieh et al., 2017), etc. However, there is still a lack of a systematic comparison among them, leaving their effects on model performance not well understood. In this work, we not only make such a comparison, but also propose a new loss function for CF, namely cosine contrastive loss (CCL). Given a positive user-item pair (, ) and a set of randomly sampled negative samples (i.e., ), the CCL loss is expressed as follows:
where calculates the cosine similarity between the representation vectors of user and item . denotes the number of negative samples. is the margin to filter negative samples, which is usually set to 01. Intuitively, CCL is optimized to maximize the similarity between positive pairs and minimize the similarity of negative pairs below the margin . is a hyper-parameter to control the relative weights of positive-sample loss and negative-sample loss.
in the computer vision tasks, such as face recognition and image retrieval. But we make several design choices that differ from most widely-used loss functions in CF and greatly facilitate model training. First, instead of applying dot product (e.g., in LightGCN(He et al., 2020)) or Euclidean distance (e.g., in CML (Hsieh et al., 2017)) to measure the similarity (or distance) between a user-item pair, we choose to compute the cosine similarity between them. By applying L2 normalization on both representation vectors, cosine similarity only calculates the angle difference and thus avoid the effect of representation magnitude. This is favorable since the magnitude of a user/item representation could be strongly biased by its popularity in CF tasks. This is also similar to the calculation of word similarity in Word2Vec (Mikolov et al., 2013), where cosine similarity is usually used.
Second, when the number of negative samples becomes large, there usually exist many redundant yet uninformative samples. But existing loss functions (e.g., BPR (Rendle et al., 2009)) treat every negative sample equivalently. As such, model training could be overwhelmed by these uninformative samples, which significantly degrade the model performance and also slows the convergence. In contrast, CCL alleviates this problem by using a proper margin to filter uninformative negative samples. Intuitively, uninformative negative samples will get zero loss in CCL when they have a small cosine similarity below the margin . As a result, it helps automatically identify those hard negative samples with cosine similarity larger than , and thus facilitates better training of the model.
Third, we found that directly summing or averaging the loss terms of all negative samples could degrade the model performance, especially when the number of negative samples is large. This is partially due to the high imbalance between positive and negative samples (e.g., 1:1000 when ). We thus introduce a data-dependent weight to control the balance between positive loss and negative loss. We emphasize that it also achieves a similar effect to the confidence weight imposed on negative samples in weighted matrix factorization (Hu et al., 2008).
3.2. Model Architecture
To leverage the advantages of CCL, we further propose a simple CF model, dubbed SimpleX. In the design of SimpleX, we keep simplicity in mind and borrow ideas from several successful models such as YouTubeNet (Covington et al., 2016), ACF (Chen et al., 2017), and PinSage (Ying et al., 2018).
Figure 1 illustrates the overall architecture of SimpleX. It largely follows the mechanism of MF, which factorizes users and items into a common latent space. Yet, SimpleX also takes the interacted item sequence of each user as additional input to better model user behaviors. This also has been shown effective in many existing studies, such as YouTubeNet (Covington et al., 2016) and ACF (Chen et al., 2017). The key part of SimpleX lies in its aggregation layer for behavior sequence aggregation. Here we introduce three common aggregation choices, including average pooling, self-attention, and user-attention, but Simplex is a unified architecture that any other aggregation method should also be applicable.
Suppose the historically interacted item set of user as , and we set its maximal size to
. For users with a different size of interacted items, either padding or chunking can be applied accordingly. As such, the aggregated vector can be obtained as follows:
where is the -dimensional embedding vector of item . denotes the mask indices to during padding, where indicates a padding token; otherwise . denotes the aggregation weight, which can be computed according to different aggregation types as follows.
Average pooling provides a straightforward way to aggregate the interacted items, which has been successfully applied in YouTubeNet (Covington et al., 2016). But it treats each item equally and fails to account for the relative importances of different items as well as a user’s preference on each item. The attention mechanism, such as self-attention and user-attention, can be applied in such cases as calculated in the lower part of Equation 3. The difference between them lies in the computation of , which is:
where is a learnable global query vector for self-attention and is the user-specific query vector for user in user-attention. and are learnable parameters. Note that similar attention mechanisms can be found in some existing work (Chen et al., 2017; Wu et al., 2019).
However, after behavior aggregation via Equation 2, the pooling vector may lie in a different latent space with user vector . We further fuse both parts to get the final user representation :
where is a learnable parameter and
is a hyperparameter weight. Finally, we measure the cosine similaritybetween user and item as the input to our CCL loss.
The above three aggregation layers provide different views for aggregation, including global-average view, global-weighed view and user-specific weighted view. The choice among them is quite data-dependent. In our experiment, we show that average-pooling is a robust aggregation method that always demands a first attempt when applying SimpleX. The other two usually needs more efforts to tune and in some cases brings marginal improvements.
3.3. Relationships to Existing Models
SimpleX is also related to multiple popular CF models.
MF. MF is the most common model for CF. SimpleX follows the similar mechanism of MF. When setting g = 1 in SimpleX, it reduces to a MF model trained with CCL (i.e., MF-CCL).
YouTubeNet. YouTubeNet is a successful model that has been widely used in industry. SimpleX can be also seen as a simplified YouTubeNet model (without using side features) when average pooling is employed. The only difference is that YouTubeNet employs concatenation instead of weighted sum to fuse and . But the latter performs better in our experiments.
GNN-based models. Simplex is also similar to GNN-based models. For instance, when choosing the user-attention aggregation layer, it almost equals to a graph attention (GAT) layer applied on user nodes only. If using the self-attention aggregation layer, it works like the neighbor interaction in NIA-GCN (Sun et al., 2020b) as well.
We emphasize that although the design of SimpleX is simple and might not be novel to some extent, it unifies several key components in existing CF models. Surprisingly, such a simple model is sufficient to surpass most state-of-the-art CF models by a large margin, which could serve as simple and strong baseline for future research.
|Pairwise Hinge Loss||0.0352||0.0267||0.0562||0.0453||0.1318||0.0996|
|Mean Square Error||0.0337||0.0267||0.0624||0.0513||0.1528||0.1315|
|Cosine Contrastive Loss||0.0559||0.0447||0.0698||0.0572||0.1837||0.1493|
In this section, we conduct comprehensive experiments to evaluate SimpleX, including: 1) studying the impacts of loss functions and negative sampling ratios, 2) making performance comparisons to existing models on three main datasets, 3) incorporating CCL to other models, 4) performing parameter analysis and efficiency evaluation, 5) further validating SimpleX on some other datasets.
4.1. Experimental Setup
We use 11 benchmark datasets in our study. For fairness and ease of comparison, we choose those open datasets that have been already split and preprocessed. Specifically:
(1) We employ three main datasets Amazon-Books, Yelp2018, and Gowalla, which are commonly used in recent GNN-based CF models (Wang et al., 2019; Chen et al., 2020b; He et al., 2020; Wang et al., 2020; Song et al., 2020; Wu et al., 2021). We perform most of our experiments on them and further make comparisons to these GNN-based models.
(2) To demonstrate the universality of SimpleX, we further test SimpleX on some other datasets adopted by studies published in top-tier conferences. Three of them, Amazon-CDs, Amazon-Movies, Amazon-Beauty, are adopted by the work NIA-GCN (Sun et al., 2020b) and BGCF (Sun et al., 2020a). The other three, Amazon-Electronics, CiteUlike-A, and Movielens-1M, are provided by NBPO (Yu and Qin, 2020b), DHCF (Ji et al., 2020), and LCFN (Yu and Qin, 2020a), respectively. Specifically, we compare SimpleX with the corresponding models on the corresponding datasets that adopted in their original papers. For example, we will compare with DHCF (Ji et al., 2020) on CiteUlike-A dataset because DHCF adopts this dataset in their original paper.
(3) The last two are Movielens-20M and MillionSongData, which are commonly used by autoencoder-based CF models, such as Mult-VAE (Liang et al., 2018) and RecVAE (Shenbin et al., 2020). We follow the strong generalization setting, which split train/validation/test sets with different sets of users, and specially make comparison with those autoencoder-based CF models to further demonstrate the effectiveness of SimpleX.
4.1.2. Compared Methods
We compare SimpleX with 29 existing CF models of different types:
Fourteen GNN-based methods, including GC-MC (Berg et al., 2018), Pinsage (Ying et al., 2018), GAT (Veličković et al., 2018), NGCF (Wang et al., 2019), DisenGCN (Ma et al., 2019a), LR-GCCF (Chen et al., 2020b), NIA-GCN (Sun et al., 2020b), LightGCN (He et al., 2020), DGCF (Wang et al., 2020), NGAT4Rec (Song et al., 2020), SGL-ED (Wu et al., 2021), BGCF (Sun et al., 2020a), DHCF (Ji et al., 2020), and LCFN (Yu and Qin, 2020a);
4.1.3. Implementation Details
We implement SimpleX in PyTorch. Specifically, we set the batch size to 1024 by default. We use the Adam optimizer and tune the learning rate among [1e-3, 5e-4, 1e-4]. We also employregularization on the embedding parameters and search the regularization weight between 1e-91e-2 with an increase ratio of 5. For cosine contrastive loss, we search the number of negative samples from 1 to 2000. In many cases, we pick 100, 500, or 1000. The margin is tuned among 0
1 at an interval of 0.1, for example, we set 0.4, 0.9, and 0.9 on Amazon-Books, Yelp2018, and Gowalla, respectively. Meanwhile, we use the same embedding size with the compared model, for example, 64 in LightGCN and 128 in LCFN. For fairness of comparison with existing models, we report the results using the same evaluation metrics (e.g., Recall@20 and NDCG@20) and duplicate the reported results on their papers for consistency. To facilitate reproducible research in the community, we have contributed our source code and detailed benchmark settings to the public Open-CF-Benchmark111https://openbenchmark.github.io/collaborative-filtering.
4.2. Impact of Different Loss Functions
While most studies focus on the interaction encoder design, they neglect the importance of loss functions in the learning of a CF model. We make a systematic comparison on the impacts of different loss functions. For this purpose, we choose one of the simplest baseline CF models, i.e., MF, as the backbone to perform the experiments, since simple models tend to be more illustrative. In addition to our CCL loss, we evaluate MF on the following representative loss functions:
Pairwise hinge loss (PHL), is also known as max-margin objective, which has been used in CML (Hsieh et al., 2017). PHL forces the distance of a negative user-item pair to be larger than a positive one by at least the marginal distance.
Binary cross-entropy (BCE) loss is commonly used for binary classification, which has been adopted in the early work NeuMF (He et al., 2017).
Softmax cross-entropy (SCE) loss is widely used for multi-class classification. YouTubeNet (Covington et al., 2016) cast item prediction as a multi-class classification task through the SCE loss.
shows the results of training MF with different loss functions on Amazon-Books, Yelp2018, and Gowalla. Note that every model has been trained with enough epochs to reach convergence and the best results are reported. From the results, we have the following observations:
1) CCL consistently achieves the best performance on all the three datasets, outperforming the other loss functions by at least 16.7%, 9.2% and 13.7% w.r.t. Recall@20 on Amazon-Books, Yelp2018 and Gowalla, respectively.
2) BPR only appears to be strong on Gowalla and performs not well on both Amazon-Books and Yelp2018. This demonstrates that using BPR for training is probably sub-optimal, and thus the results reported by many previous papers may need careful re-examination and are likely to be further improved with our CCL loss.
Why CCL performs better than the other loss functions? In addition to the design choices analyzed in Section 3.1, we further highlight the advantages of CCL with some concrete comparisons. First, in contrast to BPR, BCE, SCE, and MSE, CCL can automatically filter out hard negative samples that are hard to distinguish (i.e., large cosine similarity) by the model via its margin mechanism. For example, if we set , only those negative pairs with will contribute to the loss. Different from the above loss functions that treat each negative sample equally, CCL allows the model to emphasize on the learning of hard negative samples and thus generate more discriminative representations. Second, compared with PHL that also applies a margin mechanism, CCL is more effective for CF. The PHL loss is determined by the relative distance between positive samples and negative samples. Even if a negative sample is actually hard to be distinguished (e.g., ), it will not contribute to learning if the corresponding positive sample has . CCL avoids such ambiguity by penalizing the absolute similarity of each negative sample.
4.3. Impact of Negative Sampling Ratio
We argue that negative sampling ratio is also important in the learning of CF models, which has been largely ignored by existing studies. To support our claims, we compare the performance of MF trained with 12000 negative samples on Amazon-Books. We also repeat the experiment on different loss functions. We train each model until convergence and report the best results, as shown in Figure 2. We have the following observations from the results:
1) The number of negative samples does matter for CF model training. Generally, increasing it within a certain range leads to improvements. This suggests that we should carefully consider the impact of the number of negative samples in the evaluation.
2) MF trained with CCL is consistently better than training with the other loss functions under different negative sampling ratios, further demonstrating the superiority of our CCL.
3) The performances of PHL, MSE, and BPR become stable when the number of negative samples increases to 50. In contrast, CCL, BCE, and SCE can keep performance gains with the increase of number of negative samples, even when it reaches to 1000.
In summary, our experimental results show that both loss functions and negative sampling ratios can have a large impact on model performance. Training with the CCL loss and a large negative sampling ratio appears to be a promising setting for CF methods to gain higher performance. We therefore call for more future research towards this direction.
|Amazon-Books||Yelp2018||Gowalla||Avg RI over NGCF|
|RI over NGCF||69.6%||77.9%||21.1%||20.6%||19.2%||17.3%|
|RI over LighGCN||41.9%||48.5%||8.0%||8.5%||2.3%||0.2%|
4.4. Performance Comparison to SOTA Models
In this section, we provide a comprehensive comparison results of SimpleX and other 23 CF models on three main datasets, i.e., Amazon-Books, Yelp2018, and Gowalla, which are very commonly adopted in CF studies (especially in GNN-based CF), to demonstrate the superiority of SimpleX. Table 2 shows our performance comparisons on Amazon-Books, Yelp2018, and Gowalla under the same evaluation protocol, and we have the following observations:
1) Our SimpleX achieves the best overall performance on all the three datasets. In particular, compared with the most recent LightGCN, SimpleX makes 41.9%, 8.0%, and 2.3% performance improvements on Recall@20 for Amazon-Books, Yelp2018, and Gowalla, respectively, demonstrating the high effectiveness of SimpleX. Besides, note that we do not report the results of SGL-ED (Wu et al., 2021) and NGAT4Rec (Song et al., 2020) on Gowalla since they are not evaluated on Gowalla but only evaluated on the other two datasets in their original papers too, and the authors have not released their code. As the experimental settings of SGL-ED and NGAT4Rec are exactly same as us, we just report their results on Amazon-Books and Yelp2018.
2) The performance of MF-CCL is surprising. When using CCL as the loss function, the performance of MF is not only much better than the results of MF-BPR reported in the previous paper, but also reaches a new state-of-the-art performance (if leaving out our SimpleX) on Amazon-Books and Yelp2018. On Gowalla, it also achieves comparable performance compared to the previous best model DGCF. Such results strongly suggest that loss functions can make a big difference and should be carefully chosen and studied.
3) YouTubeNet, CML, and SLIM are three models that we added and have not been tested on these three datasets before by the existing work. We found that they achieve pretty good performance. Specifically, these three models can averagely outperform a representative GNN-based CF model – NGCF, by more than 24% and 28% w.r.t. Recall@20 and NDCG@20, respectively. This implies that the current baselines are relatively weak, which may lead us to overestimate how much real progress we have made in CF.
4) In CF tasks, more complex models not always lead to better performance. The designs of SLIM, YouTubeNet, CML, MF-CCL, and our SimpleX are all much more concise than most of autoencoder-based (e.g., Mult-VAE and MacridVAE) and GNN-based models (e.g., NGCF, NIA-GCN, and DGCF), but they can achieves better performance. This also reveals that the current trend in CF research, which pays too much attention to the design of sophisticated interaction encoders while ignoring the impacts of loss functions and negative sampling, needs to be improved.
4.5. Incorporating CCL to Other Models
In Table 2, we have shown that one of the simplest models, i.e., MF, can even largely outperforms most of state-of-the-art models if training with CCL. We are curious about how other models will perform if incorporated with CCL instead of their original losses. Therefore, in this part, we take experiments with two effective CF models in addition, i.e., YouTubeNet and LightGCN with CCL, and report the results on Amazon-Books and Yelp2018 in Table 3.
From the results, we find that training YouTubeNet and LightGCN with CCL instead of their original loss functions, i.e, SCE and BPR respectively, can bring good improvements. This demonstrates that CCL is likely to be a more promising loss function to help CF models achieve better performance. Besides, we observe that the improvements brought by CCL on YouTubeNet and LightGCN are not as significant as those on MF. CCL seems to improve these models to a similar level of performance. This may be because of the following reason: Generally, valuable collaborative information can be captured by both the interaction encoder and the loss function. As the encoders of YouTubeNet and LightGCN are sophisticated and stronger to learn biased collaborative signals, by contrast, the impact of the loss function to them appears relatively small.
In addition, it is worth noting that our main focus is to question the value of sophisticated encoders and provide a simple strong baseline, but not to improve current state-of-the-art CF models by exhaustingly trying of various loss functions. Based on the experiments with MF, YouTubeNet, and LightGCN, we demonstrate and highlight that the loss function is a large bottleneck in CF models. We expect our work could inspire more research to study the co-design of the interaction encoder, loss function, and negative sampling.
4.6. Parameter Analysis on SimpleX
|g = 0||0.0534||0.0429||0.0679||0.0555|
|g = 0.5||0.0583||0.0468||0.0688||0.0565|
|g = 1||0.0540||0.0432||0.0701||0.0575|
We investigate the performance of three different behavior aggregation layers, the fusing weight , and the negative loss weight . Results on Amazon-Books and Yelp2018 are shown in Table 4. We can make the following observations: 1) Average pooling, self-attention, and user-attention obtain very similar results on Amazon-Books and Yelp2018, respectively. This shows the robustness of apply average pooling for behavior aggregation in practice. SimpleX with reaches higher performance compared with the other two settings on Amazon-Books, which shows that importance of fusing user embedding with user behavior aggregation. 2) The negative weight which adjusts the ratio of positive and negative losses is vital to model’s performance. In general, too small () or too large () difference between positive and negative losses leads to performance reduction.
|RI over NIA-GCN||18.6%||22.9%||RI over NIA-GCN||26.8%||35.5%||–||–||–|
|RI over BGCF||17.1%||20.8%||RI over BGCF||25.9%||33.6%||RI over BGCF||12.2%||12.8%|
|RI over NBPO||8.0%||4.0%||RI over DHCF||18.7%||8.2%||RI over LCFN||2.0%||2.6%|
|RI over Mult-VAE||3.8%||1.7%||2.3%||6.5%||3.2%||4.7%|
|RI over EASER||7.8%||7.4%||6.2%||44.0%||40.3%||43.3%|
|RI over RecVAE||1.1%||0.6%||1.1%||5.8%||3.2%||4.8%|
4.7. Efficiency Comparison
Our SimpleX has high efficiency due to its simple design. We numerically compare the training time of SimpleX with two state-of-the-art CF models, i.e., ENMF and LightGCN, which are relatively efficient in their respective categories, on Amazon-Books. The efficiency experiments are conducted on the same Intel(R) Xeon(R) Silver 4210 CPU @2.20GHz machine with one GeForce RTX 2080 GPU. We compare them under the same implementation framework, using the same acceleration methods (e.g., implementing the sampling with C++) to ensure fairness. Specifically, we present the averaged training time per epoch, the number of epochs that the model needs to reach the level of performance reported in the original paper, and the total training time (test time is not included), in Table 5.
It turns out that SimpleX is much more efficient than ENMF and LightGCN overall. Specifically, SimpleX only needs around 30 epochs to converge in training, which is more convenient for real application. The total training time of SimpleX with a 1000:1 negative sampling ratio has around 2x and 10x speedup compared with ENMF and LightGCN respectively. Moreover, if we decrease the negative sampling ratio to 100:1, the training time for one epoch of SimpleX can be optimized to 40s, finally resulting in only 19 minutes total training time. Certainly, the performance slightly drops compared with using a 1000:1 negative sampling ratio, but it still maintains a pretty good level (much better than ENMF and LightGCN). Such high efficiency makes our model promising to be applied in large-scale real recommender systems.
4.8. Evaluating SimpleX on More Datasets
In addition to the three main datasets used in the above sub-sections, we additionally evaluate SimpleX on 8 more datasets to further demonstrate the generability of SimpleX.
Table 6 shows the comparison results to some state-of-the-art CF models published in 2020. For fairness of comparison, we use the same data preprocessing and experimental settings (embedding dimensions and evaluation metrics) provided by the corresponding papers. We observed that SimpleX consistently outperforms all the compared models on different datasets. The performance improvements are especially large (12.8% to 33.6% improvement in NDCG@20) on Amazon-CDs, Amazon-Movies and Amazon-Beauty compared to BGCF, a recent GNN-based model. This again strongly verifies the effectiveness and robustness of SimpleX to serve as a strong baseline in future work.
Table 7 presents the evaluation results on Movielens-20M and MillionSongData. We can see that SimpleX obtains better performance than SLIM, which is a well-known strong baseline for CF. But it does not surpass Mult-VAE, EASER and RecVAE given their complete forms. This is reasonable because all of them use many more parameters ( for Mult-VAE and RecVAE, for EASER) than SimpleX, as shown in the “#Params” columns. Note that both Mult-VAE and RecVAE use 600 as the dimension of the first hidden layer. As the number of items () easily reaches millions to billions in industrial recommender systems, we choose a small embedding dimension (i.e., 64) and results in parameters in the scale of . To make the comparison more fair, we reduce the embedding dimensions of baseline models accordingly. Specifically, for Mult-VAE and RecVAE, we set its encoder and decoder as a single -dimensional dense layer. For EASER, we decompose its item similarity matrix (denoted as B) to two -dimensional sub-matrices by truncated SVD, and multiply the two sub-matrices to approximate the item similarity matrix to perform predictions. In this setting, SimpleX clearly outperform these autoencoder based CF models.
Overall, our comprehensive experimental results on various datasets show that our SimpleX is simple and strong to serve as a new baseline model to facilitate future research on CF. The availability of this baseline would allow for more solid experimental evaluations and more fair comparisons among CF models.
In this paper, we study the progress made in CF research and identify three key aspects for CF modeling. While most research focuses on interaction encoders, the impacts of loss functions and negative sampling on CF models have been largely neglected. In this work, we highlight their impacts and further propose the cosine contrastive loss together with a simple and strong baseline for CF, dubbed SimpleX. It outperforms most state-of-the-art CF models by a large margin. Our work released the simple and strong baseline model and the whole benchmarking results for foster future research on CF. We conduct extensive experiments to validate the effectiveness and efficiency of SimpleX. We suggest that the CF community should pay more attention to other key components in addition to interaction encoders and encourage researchers to conduct more robust empirical evaluation.
This work was supported in part by the National Natural Science Foundation of China (61972219), the Research and Development Program of Shenzhen (JCYJ20190813174403598, SGDX20190918101201-696), the National Key Research and Development Program of China (2018YFB1800601), and the Overseas Research Cooperation Fund of Tsinghua Shenzhen International Graduate School (HW2021013).
- Graph convolutional matrix completion. In KDD’18 Deep Learning Day, Cited by: 3rd item.
- Efficient neural matrix factorization without sampling for recommendation. ACM Transactions on Information Systems (TOIS) 38 (2), pp. 1–28. Cited by: §1, §2.2, 1st item, 5th item.
- Attentive collaborative filtering: multimedia recommendation with item- and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 335–344. Cited by: §1, §1, §3.2, §3.2, §3.2.
Revisiting graph based collaborative filtering: a linear residual graph convolutional network approach.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 27–34. Cited by: 3rd item, §4.1.1.
- Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on Recommender Systems (RecSys), pp. 191–198. Cited by: §1, §1, §1, §1, §2.2, §3.1, §3.2, §3.2, §3.2, 4th item, 4th item.
- Reinforced negative sampling for recommendation with exposure data. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), pp. 2230–2236. Cited by: §2.1.
- Collaborative memory network for recommendation systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR), pp. 515–524. Cited by: §2.2, 4th item.
Dimensionality reduction by learning an invariant mapping.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1735–1742. Cited by: §1, §3.1.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1024–1034. Cited by: §2.2.
- LightGCN: simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (SIGIR), pp. 639–648. Cited by: §1, §1, §1, §2.2, §3.1, 3rd item, 1st item, §4.1.1.
- Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (WWW), pp. 173–182. Cited by: §1, §1, §2.2, §3.1, 1st item, 3rd item.
- Collaborative metric learning. In Proceedings of the 26th International Conference on World Wide Web (WWW), pp. 193–201. Cited by: §1, §2.2, §3.1, §3.1, 4th item, 2nd item.
- Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), pp. 263–272. Cited by: §3.1, 5th item.
- Dual channel hypergraph collaborative filtering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 2020–2029. Cited by: §2.2, 3rd item, §4.1.1.
- Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: 1st item.
- Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference (WWW), pp. 689–698. Cited by: §1, §2.2, 2nd item, §4.1.1.
Disentangled graph convolutional networks.
International Conference on Machine Learning (ICML), pp. 4212–4221. Cited by: 3rd item.
- Learning disentangled representations for recommendation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5711–5722. Cited by: §2.2, 2nd item.
- A deep recurrent collaborative filtering framework for venue recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM), pp. 1429–1438. Cited by: §2.2, 1st item.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119. Cited by: §3.1.
- SLIM: sparse linear methods for top-n recommender systems. In IEEE 11th International Conference on Data Mining (ICDM), pp. 497–506. Cited by: §2.2, 4th item.
- BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 452–461. Cited by: §1, §3.1, §3.1, 1st item.
- RecVAE: A new variational autoencoder for top-n recommendations with implicit feedback. In The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM), pp. 528–536. Cited by: 2nd item, §4.1.1.
- Two decades of recommender systems at amazon.com. IEEE Internet Comput. 21 (3), pp. 12–18. Cited by: §1.
- NGAT4Rec: neighbor-aware graph attention network for recommendation. arXiv preprint arXiv:2010.12256. Cited by: §2.2, 3rd item, §4.1.1, §4.4.
- Embarrassingly shallow autoencoders for sparse data. In The World Wide Web Conference (WWW), pp. 3251–3257. Cited by: §2.2, 2nd item.
- A survey of collaborative filtering techniques. Adv. Artif. Intell. 2009, pp. 421425:1–421425:19. Cited by: §1.
- BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pp. 1441–1450. Cited by: §1.
A framework for recommending accurate and diverse items using bayesian graph convolutional neural networks. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 2030–2039. Cited by: §1, §2.2, 3rd item, §4.1.1.
- Neighbor interaction aware graph convolution networks for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 1289–1298. Cited by: §1, §2.2, 3rd item, 3rd item, §4.1.1.
- Graph attention networks. In International Conference on Learning Representations (ICLR), Cited by: 3rd item.
- Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 165–174. Cited by: §1, §2.2, 3rd item, 1st item, §4.1.1.
- Disentangled graph collaborative filtering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 1001–1010. Cited by: §1, §2.2, 3rd item, 1st item, §4.1.1.
- NPA: neural news recommendation with personalized attention. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 2576–2584. Cited by: §1, §3.2.
- Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 726–735. Cited by: §2.2, 3rd item, §4.1.1, §4.4.
- HOP-rec: high-order proximity for implicit recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys), pp. 140–144. Cited by: §2.2, 1st item.
- Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference (WWW), pp. 441–447. Cited by: §2.1.
- Adaptive semantic-visual tree for hierarchical embeddings. In Proceedings of the 27th ACM International Conference on Multimedia (MM), pp. 2097–2105. Cited by: §1, §3.1.
- Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 974–983. Cited by: §2.2, §3.2, 3rd item.
- Graph convolutional network for recommendation with low-pass collaborative filters. In International Conference on Machine Learning (ICML), pp. 10936–10945. Cited by: §2.2, 3rd item, §4.1.1.
- Sampler Design for Implicit Feedback Data by Noisy-label Robust Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 861–870. Cited by: §2.1, §2.2, 4th item, §4.1.1.