Industrial recommender systems usually adopt a multi-stage pipeline, where the first stage, candidate generation, needs to retrieve relevant items from an extremely large candidate set. Traditional collaborative filtering techniques achieve great successes in the earlier years, while deep learning models have gained increasing popularity recently, due to their flexibility to work as general-purpose solutions for varieties of recommendation problems, as well as the successful deployments in many live systemsCovington et al. (2016); Chen et al. (2019); Zhou et al. (2018b, 2019); Ying et al. (2018); Grbovic and Cheng (2018).
Deep candidate generation (DCG for short) models aim to learn representations of both users and items, which can then be used to retrieve K-nearest neighbor (KNN) using vector based KNN service, e.g.,(Johnson et al., 2017). Among the works in literature Covington et al. (2016); Grbovic and Cheng (2018); Zhu et al. (2018) that share their experiences in building industrial DCG systems, the standard learning methodology regards recommending items for a user as a multiclass classification problem, i.e. assigning a user with a target label (item) among all possibilities (the item set). As the label size becomes large, it optimizes the log-likelihood of a softmax objective by using numerous sampling approximation strategies that requires explicit sampling of negative examples from a proposal distribution, e.g., NCE Gutmann and Hyvärinen (2010), hierachical softmax, negative sampling Mikolov et al. (2013) and sampled softamx Bengio and Senécal (2008), all of which stem from the language modeling community, where the tasks may be different from those in recommender systems in nature.
While sampling seems to be a straightforward strategy for handling large-scale data, it’s unclear whether explicit sampling for fitting the logged data would be the best choice for recommendation tasks, in terms of efficiency and effectiveness. In this paper, we propose a contrastive learning framework CLRec that does not depend on any explicit sampling while achieves better performance at our large scale live system. Rather than insisting on explicit sampling, CLRec maintains a queue-based buffer to accumulate previously processed examples or representations as negative labels, which are to be discriminated from the positive label of the current instance. This queue-based framework processes the mini-batches continuously and is, in effect, optimizing a contrastive learning objective Oord et al. (2018). We find this framework to have several desirable properties compared to previous works deployed in industrial live system, including both effects of debiasing and engineering efficiency.
The efficiency advantage comes from the fact that we no longer need to sample from an explicit proposal distribution, which is also reported to be quite sensitive for different datasets in recommendation tasks Caselles-Dupré et al. (2018). Unlike the words (labels) in language modeling that usually utilizes no other features than word id since the development of deep learning, items usually have abundant feature information, e.g., category id, price, popularity, images, texts and external knowledge, which can be learned and shared from the popular items to the cold ones. Recent developments of DCGs take advantages of various types of features (Covington et al., 2016; Zhou et al., 2019; Ma et al., 2019; Wang et al., 2018b), however, mainly focus on the context-side. Few previous works Bai et al. (2019) notice that feature encoding at the label side can be of great help for highly sparse data. Decoupling from sampling negative examples makes it easier to encode items of rich data types at the label-side in both time and space efficient ways. This advanced label-side encoder, which is previously considered as per-label parameters when optimizing multi-class objectives, can help improve the representation quality of the long-tailed items that are under-exposured, therefore increases the abilities to retrieve them.
Despite its simplicity and efficiency, we also observe a large margin improvement in our online AB test at Mobile Taobao. In fact, CLRec can also has the effects of debiasing item impressions Wang et al. (2016) in the logged data, which contributes to the performance improvement especially in live experiment. Compared to approximating the log-likelihood of softmax objectives in the logged training data, the contrastive loss we exploit can be understood as optimizing an objective that approximately corrects the biased impression term. This is in nature different from language modeling where the selection bias is less considered.
We summarize our main contributions as follows.
We present CLRec, a contrastive learning framework for large scale deep candidate generation. We provide several implementation alternatives for the buffer policy and gradient updating mechanism under the same framework to meet different efficiency requirement.
We analyze and compare CLRec with several popular sampling strategies that directly optimize the softmax gradients, providing theoretical evidence that our method can alleviate selection bias in recommender system.
We evaluate CLRec at both public and production dataset and conduct online A/B test in our recommender system, showing a large margin improvement on the performance. We also observe a more diversified and fairer recommendation due to the debias ability of the model.
2.1 Problem Setting of Deep Candidate Generation Approaches
Given a dataset , where represents a user’s historical interaction with the system and is the length of this sequence. is an item consisting of various kinds of features while is the entire item candidate set. Note that, features of could be in any form such as ids, values, pretrained embeddings, or raw input such as images and texts.
Deep candidate generation approaches learn user encoding function and item encoding function so that given any at test time, it predicts an item set according to a scoring function with . and
are both neural networks parametrized by111We use subscripts of and to represent query and key for further usage . We denote by for convenience.
2.2 Sampling Approximations of Maximum Likelihood Estimation
To learn such , a self-supervised model is adopted by splitting the dataset , e.g., pairing a sub-sequence and the left items in the sequence as positive examples. Such splitting can be in forms of autoregressive Zhou et al. (2018a); Kang and McAuley (2018); Zhou et al. (2019), cloze test Sun et al. (2019)
, or autoencoderLiang et al. (2018)
. Supervision can then be made by estimating the conditional probability of interacting with itemgiven a user . This is usually represented as a multinomial distribution (Covington et al., 2016),
Maximizing the log-likelihood of involves time-consuming computations over all possible user-item pairs. Popular approximation methods, including NCE Gutmann and Hyvärinen (2010), negative sampling Mikolov et al. (2013) and sampled softmax Bengio and Senécal (2008); Jean et al. (2014), are developed to approximate the gradients of . Methods like NCE and negative sampling convert the multi-class objective into a binary cross entropy loss, by discriminating positive pairs in the dataset from negative pairs sampled via a proposal distribution , treating each sample pair individually. However, multiple works find problems in applying these methods to large size vocabulary Covington et al. (2016); Jean et al. (2014). They instead employ a sampled softmax technique Bengio and Senécal (2008) based on weighted importance sampling to approximate the softmax gradient.
where are drawn from , a proposal distribution. To gain further efficiency in practice, is regarded as by discarding the dependencies of and . Performances are quite sensitive to sampling from the proposal distribution , which is hard to be chosen in practice Caselles-Dupré et al. (2018). They also incur non-negligible overheads especially in the distributed environment for a given Stergiou et al. (2017). To avoid inefficient sampling with high dimensional item features, DCGs with sampled softmax employ item features at the query-side user encoder , while ignoring them at the label side which now only serves as an embedding lookup for the item index.
In this Section, we introduce a contrastive learning framework CLRec to generate candidates from an extremely large candidate set for practical recommender systems and theoretically prove that CLRec alleviates the selection bias.
3.1 CLRec : a Contrastive Learning Framework for Deep Candidate Generation
We denote by the paired training dataset derived from the original sequence set, which contains all the positive pairs of queries and keys. here could be any object like user, item, search query or other entities in recommender system. could also be the same with , for example, we can learn a user-user similarity function using CLRec as will be discussed in Section 3.2.3 when represent the same user.
CLRec contains a streaming buffer and a parameter updating strategy in addition to the standard components in a DCG model, i.e. a query encoder , a key encoder and a non-parametric similarity function . Streaming buffer maintains a queue of capacity that holds the raw input data or representations of previous keys. It follows First-In-First-Out (FIFO) replacement strategy to update the queue when new keys of a batch arrive. The updating policy controls the gradient updating of each encoder.
We illustrate the general framework in Algorithm 1. Given the current batch of examples , buffer dequeues the least recent entries and enqueues the input data or encoded representations of the keys in , according to its implementation. Then CLRec optimizes the following objectives according to the keys in and the queries in ,
where the similarity function
employs a cosine similarity measurement with a temperature as its hyper-parameter
This choice of the similarity function is related to the recommendation diversity which we discuss in Section References. first conducts the forward computation of according to Equation 3, depending on the implementation whether is computed from scratch or reused from a cached representation in . Then it calculates and updates via back propagation. then updates from either or , via back propagation with or update indirectly from (He et al., 2019). Different choices of and will result in different implementations of CLRec as in Sections 3.2.1, 3.2.2 and 3.2.3.
Equation 3 has a surprisingly familiar and simple form that only optimizes a softmax cross entropy within , which is adopted in (Sohn, 2016) and also known as InfoNCE (Oord et al., 2018). Optimizing this loss is proved to be equivalent as maximizing a lower bound of mutual information of and (Oord et al., 2018). Compared to traditional approaches as illustrated in Equation 2, CLRec has no correction terms of which we show that it has the ability to alleviate data bias in Section 3.3.
Buffer-Based vs. Sampling-Based Method. CLRec has a buffered queue to maintain previous examples as negatives. Compared to existing sampling-based methods, CLRec does not incur any sampling procedure to explicitly generate negative samples. In fact, CLRec treats as negative pairs implicitly within . It does not need any effort to deal with large scale negative sampling, requiring no preprocess phases, extra memory spaces or remote communications. Compared to traditional sampled softmax approach, CLRec reduces the memory requirement, or we can enlarge the number of samples within a batch under the same memory capacity thus reducing the variation of the sampled loss. In addition, it gets rid of the sensitive choice of the proposal distribution as our methods perform stable in practice. The buffer-based design also gives us an opportunity to efficiently utilize more complex features in encoding the candidate set, especially in distributed environment where negative candidate sampling with features is both time costly and space inefficient.
3.2 CLRec Implementations
Next, we propose three different implementations of CLRec framework, CLRec-intra, CLRec-inter and CLRec-moco, which has increasingly efficiency in encoding keys with richer data types. To summarize, CLRec-intra is simple yet effective if we do not encode features in the label-side, CLRec-inter supports larger batch and less variation with the same memory budget. CLRec-moco prevents both forward encoding and backward gradients in the buffer, which can be used to encode more complex data forms such as images, texts or external knowledge.
3.2.1 Intra-Batch Contrastive Learning
We first introduce a basic implementation of CLRec called Intra-Batch Contrastive Learning, whose strategy of sampling negative examples are also explored in Hidasi et al. (2015); Chen et al. (2017) but with a binary loss. The streaming buffer has the same capacity with the batch size, i.e. . It receives a batch data and simply replaces the keys with those in the current batch. .enqueue_fn() only keeps the input data of the keys in and does not hold any buffered encodings . calculates all from scratch. It then directly updates each encoder with its current gradient in a regular way via back propagation.
The intra-batch implementation of CLRec has the same number in queries and keys and needs encode both of them from scratch. Therefore it still restricts the number of keys in a batch due to the limited GPU memory especially when the size of the input data is much larger than that of the learnt representations.
3.2.2 Inter-Batch Contrastive Learning
As a larger number of keys can reduce the sampling variance which is suggested to be big enough w.r.t. the item number, we propose another implementation alternative called inter-batch contrastive learning.
In inter-batch contrastive learning, the streaming buffer holds more than one batch of keys in the queue. As in Figure 1(b), we can see that maintaining in fact creates a sliding-window of negative samples that are collected from earlier batches. For the current batch of query , it regards all the input remained in the queue as the negative keys. In the computation of contrastive loss, it will recompute all the representations in using the most recent updated key encoder. It can be seen that Intra-Batch Contrastive Learning is in fact a special case of Inter-Batch contrastive learning when the queue size is the same with the batch size .
This further enlarges the candidate number because the key number now is decoupled from the query number, whose encoder usually needs to be more complex due to the nested input such as behavior sequence.
3.2.3 Momentum Contrast (MOCO)
If the input data of the keys are also of rich data types, e.g., if we have text or image input data in items, encoding items in from scratch is both time-consuming and memory inefficient, especially when back-propagating. Inspired by Momentum Contrast (MOCO) He et al. (2019), we provide an alternative to cache representations of the keys rather than the input data while keep updating the key encoder without calculating the gradients, as illustrated in Figure 1(c).
In MOCO, the trainable parameters in the key encoder need to be of the same structure with as it deals with the homogeneous data. In our case, it only requires that all parameters in has its counterparts that are of the same shapes and serve the same purposes in the query encoder . 222If query represents user and key represents item, this means the item encoding networks should be the same, in terms of the networks’ architecture, across and .
Given a batch data , it first calculates the key representations using and enqueues these representations rather than data. Keys in the buffer do not involve any computation of forward or back-propagation. Instead, it only calculates gradients of the query encoder based on which it is updated as regular back propagation. As all the trainable parameters in have their counterparts in , the key side encoder can be updated indirectly by following ’s trajectory. In other words, updates as a slower version of the parameters in by a momentum controller ,
For the extreme case when , the key encoder becomes the most recent updated query encoder, which is the same as the standard memory bank (Wu et al., 2018a). Practically, we may need to set close to 1 if the encoder is very complex, e.g. an extremely deep image encoder as demonstrated in He et al. (2019), to enable a much slower and more consistent update. This enables a more stable updates on the key encoder, making the training process not prone to noisy or disturbing samples, at the cost of slower convergence.
CLRec-moco provides opportunities to efficiently implement advanced encoders on keys with rich types of features. With the help of CLRec-moco , we investigate another style of self-supervised training, which we call the user-to-user pretext task. Details about implementations and results can be found in Section C.4.
3.3 Analysis of the Debias Effects of CLRec
In this subsection, we first illustrate that DCG method usually suffers from selection bias (also known as Missing Not At Random (MNAR) (Little and Rubin, 2019; Schnabel et al., 2016)), as the training data are biased in recommender systems. We then theoretically analyze that CLRec can help alleviate it. The general idea is that the proposal distribution removed from Equation 2 in fact works as a proximation of the correction term in the Inverse Propensity Weighting (Wang et al., 2016), which corrects the biased data distribution. We demonstrate both empirically and theoretically that this approximation is sufficient.
To be more specific, DCG methods try directly fitting the training data distribution, i.e. by minimizing the cross entropy between and ,
where is the distribution of logged user clicks that is affected by the recommendation list given by the previous system, which is not random uniform. Therefore, is a biased distribution w.r.t. that reveals the true intentions of user.
To obtain an unbiased estimation of, Inverse Propensity Weighting Wang et al. (2016) proposes to optimize the following weighted loss instead,
where is the impression distribution which is the summarized effect of all the models deployed online, representing the probability of observing the item given the user in the previous system.
However, in practice, the impression distribution is extremely challenging to estimate, whose exact distribution is the combined result of not only the last version of the DCG model, but also the downstream ranking models as well as many ad-hoc rules. Second, even if we try to estimate it by fitting the logged impression data, will be very likely close to zero for the majority of the user-item pairs due to the large scale of our system, where the number of users and items can both easily exceed 100 millions. This second issue can bring undesirable numerical instability since we are weighting the loss with the inverse of .
We therefore propose to approximate by replacing it with , the frequency that item is being exposed by the previous version of the system to any user. This approximation, while conceptually simple, is demonstrated to be effective in practice as will be shown in Section 4.2.3. Based on this approximation, our training loss becomes:
We will now prove that our contrastive loss is equivalent to minimizing the loss stated in Equation 8.
We now focus on one single training batch. Let be the multiset of items that appears in the present fixed size batch, where same items may appear multiple times. We then focus on a user in this batch, whose associated positive item is .
The key insight is that, in every batch, we are actually minimizing the cross entropy between the sampled data
and the (un-normalized) distributionrepresented by the model. We would like to emphasize that: (2) the support of these two distributions are , i.e. (instead of ) here, and (2) the cross entropy loss, which is equivalent to the KL divergence, is asymmetric and the direction matters.
According to the Bayes’ rule,
Here is in fact easy to compute, once we notice that for a to appear in , should either be the true label (i.e. the positive item) or be sampled from the distribution . Here we have just skipped the proof that CLRec is in fact equivalent to sampling the negative items according to the distribution .) Therefore,
Based on the above results, we can see that the contrastive loss is in fact minimizing the cross entropy between the (un-normalized) data distribution and the (un-normalized) distribution represented by the model, which is equivalent to minimizing the loss stated in Equation 8.
We first illustrate the CLRec framework helps improve the performance of existing DCGs on public datasets. Then we focus on the large scale offline/online experiments in our production system, showing that CLRec can improve both accuracy and fairness in large scale recommender systems.
4.1 Experiments on Public Dataset
We evaluate on three public datasets, whose statistics are in Table A1 in the Appendix.
Competitors and Metrics. We choose two representative DCG methods to demonstrate that CLRec framework can improve the performance given any entity encoders. SASRec Kang and McAuley (2018)
is a self-attentive model that predicts the next item given all the previous historical behaviors with a negative sampling loss formulated as a binary-classification problem. which follows the negative sampling loss in a binary-classification format.BERT4Rec Sun et al. (2019) is a bidirectional model that predicts masked items from behaviors of both side, using the full softmax cross entropy. We evaluate CLRec and the competitors with the measures of hitrate, ndcg and mrr as in Sun et al. (2019) on public datasets and list the results in Table 1.
As summarized from Table 1, for relatively small scale datasets, SASRec with negative sampling loss behave similarly to Bert4rec that uses full softmax. CLRec adopts the intra-batch contrastive learning and uses the encoders of SASRec. It achieves relatively significant performance gains, especially when the candidate size becomes larger.
|Method||CTR||User Duration Time||Popularity Index of Recommneded Items|
|MIND||5.87%||-||0.658 (high tend to recommend popular items )|
|negative sampling Mikolov et al. (2013)||0.071||-|
|shared negative sampling Ying et al. (2018)||0.064||-|
|sampled-softmax Jean et al. (2014)||0.176||3.32%|
|Impl. Sup. Type||pred1-hr@50||pred5-hr@50|
|sampled-softmax + label-feat.||0.13m||1100|
|CLRec-inter + label-feat.||0.28m||500|
|CLRec-intra + label-feat.||0.28m||500|
|WHA + cosine||0.197||3.21%|
4.2 Online and Offline Experiments in Production
CLRec has been successfully deployed on a world leading E-commerce platform Mobile Taobao, the largest online shopping app in China. Each day, the recommender system of our scenario has to deal with hundreds of millions of page views.
4.2.1 Implementations of User and Item Encoders
Our user encoder, , is in a simple form of multi-head self-attention neural network which encodes the user behavior sequence that consists of all his/her interacted items. An extra weighted heads aggregation layer He et al. (2017), denoted as WHA, is then added on top of the heads embedding. The item encoder is a sum of reversed positional embedding, relative time bucket embedding, item id embedding and the overall embedding of other features, which can be shared at both query-side and label-side. Detailed implementations are illustrated in Appendix References.
4.2.2 Deployment Details
The total number of items is over 100 million, and we choose 7 categorical features including item id, root category id, leaf category id, seller id, seller type, brand id and gender preference among all the features of an item to encode. Note that, CLRec can potentially handle more complex features, e.g., images, texts and external knowledge, which we leave for future work.
. Predicting the next item of short sequence can easily introduce abundant noise which harms the performance, so we limit the minimum length of a valid sequence to be 5. We use the data collected from the last 4 days as the training set and use the click data logged in the following day as the test set. The training samples contain 4 billion user behavior sequences, and we train the model in a distributed tensorflow cluster of 140 workers and 10 parameter servers.
Offline Evaluation. As the candidate generation phase aims at retrieving relevant items, we use hr@N as our offline metrics for DCG. Note that, we approximately generate top-k candidates from the entire item set using a distributed KNN library developed on top of Johnson et al. (2017), rather than caculating the exact KNN from a randomly sampled candidate set with relatively small size. We also find this metric consistent for the online and offline performance.
Online Serving We store and index the trained item embeddings using an online vector-based KNN service for nearest neighborhood search. We infer a user embedding as the request of arrives. Then we conduct a KNN search and retrieve the most relevant item set , the query latency is around 10ms. Then a separate ranking model is responsible to produce the final results within .
Online AB test. CLRec has been fully deployed into Mobile Taobao since Jan 2020 and it consistently outperforms the previous state-or-art baselines which we now leave as the roll-back strategies. We also report the ablation studies on encoder stuctures during our model iteration, which are assigned with at least traffic and observed for at least 1 day.
We consider click through rate (ctr) as the ground truth metrics. Note that, considering the costs of exploration in live systems, we deploy a model online only if it has at least similar performance with the previous baselines. If a method has very poor offline performances, or relatively minor difference compared to ones that perform better offline, we do not serve it online. In such case, we will mark as “-” its online performance in the table.
We first compare CLRec with other sampling alternatives adopted in large scale industrial recommender systems, observing a large margin improvement on both offline hitrate and online ctr. We also experiment with an user-to-user pretext task with the help of CLRec-moco, revealing that this new type of self-supervision can in fact improve the performance. Then we investigate the debiasing effects of CLRec, revealing insights about the source of the performance gain. Studies of the label-side feature encoders further highlights the engineering advantages of no explicit sampling. In addition, we conduct ablation study of our encoder implementation along with a case study showing its ability to produce diversified recommendation results. As offline/online evaluations were conducted on different days, numbers reported across the tables may not be comparable.
Comparison with other Sampling-Based Alternatives. Detailed implementations of all alternatives of loss and sampler that we compare can be found in Section C.3 in the Appendix.
In Table 3, we first observe that negative sampling approaches have rather poor performances, indicating that the binary classification loss cannot scale well to such big item sets. We see a small improvement in offline hitrate of CLRec against sampled-softmax, however, both CLRec-inter and CLRec-intra achieves significant improvement for online ctr. This indicates that there exists discrepancy between the offline improvements and the online ones for candidate generation, which may stem from the bias in the training data. Note that with the same treatment on the key number, query numbers of CLRec-inter is smaller compared to CLRec-intra .
Offline Evaluation of user2user Self-Supervision under CLRec-moco. CLRec-moco allows efficient training even if the keys are of rich data types. Here we demonstrate that MOCO can help us implement user-user training efficiently, which improves the offline results. We train our model by adding another user-user prediction loss, where a query is a user’s earlier sequence while the key is the user’s future sequence of five clicks. The key encoder has the same architecture as the query encoder for encoding click sequences. Training details can be found in Section C.4 in the Appendix. We also note that the original user-item prediction loss is still necessary and is used along with the user-user prediction loss.
The offline results are shown in Table 4. We can see the user-user training strategy improves the model’s ability to make long-term prediction, i.e., accurately predict the next five clicks instead of just the next one click. However, we observe that momentum is not necessary, due to the fact that (1) our model is relatively shallow compared to the deep models for encoding images He et al. (2019) and (2) we have reused the same item embedding table, which serve as a consistent dictionary shared by both the key and the query encoders.
Effects of Debias. To examine the debais effects of CLRec , we summarize the aggregated diversity (Adomavicius and Kwon, 2011), i.e. the number of distinct items in all recommended results in Table 5, and report the resulted impression distributions of the recommended items in Figure 2. Notice that there are minor differences for CLRec with different implementations, and we only report CLRec-intra here.
Table 5 illustrates that CLRec has an over 2x improvement on the metric of aggregated diversity. We can also see from Figure 2 that, sampled softmax strategies approximates the impression distribution of the training data with selection bias much better. CLRec, however, tends to learn a quite different distribution shifting towards the under-exposure items. These suggest that the proposed contrastive learning approaches result in a fairer recommendations to those under-exposure items to correct the “rich get richer” phenomena. Given the fact that the sampled-softmax approach also uses the same encoder, we may conclude the huge performance gain is caused by the debias effects.
Label-side Feature Encoding. We study the effects of label-side feature encoding. Table 6 reveals that label-side feature encoder is beneficial for the performance, highlighting the importance of CLRec that enables us to explore on more advanced features at the label side.
Engineering Efficiency. We illustrate that CLRec is much more time efficient than sampling based method. As shown in Table 8. Currently we only use small item feature input such as item id and category id, which are 7 integer values only. It can be inferred that the efficiency gap will be much wider as we encode more complex item features. Note that sampled-softmax with features is implemented with a distributed in-memory storage service provided by (Zhu et al., 2019). CLRec with label-side feature does not incur extra parameter updates since these features are already considered in the query-side of CLRec .
Ablation Study of the Encoder Structure in CLRec . We conduct ablation studies across different time periods due to the iterative model developments, illustrating in Table 7 and Table 9 that the encoders can benefit from the similarity measurement of temperature-cosine, WHA, time bucket encoding and reverse positional encoding As can be seen in Table 9, cosine is needed to reach a higher hitrate for our production system. We suspect the improvements coming from the varing nature of users’ interests in practice. We find previous methods especially with dot product easily lead to a much skewer interest distribution in the top-k retrieved candidates, making many possible items of interest under-exposured. Many works have also noted this problem and propose ways to solve it, such as using multiple interest vectors Ma et al. (2019); Li et al. (2019) or multiple interest sub-models Zhu et al. (2018). To further verify our guess, we illustrate a case of the retrieved items from different encoders in Figure 3. We can see that, among the diversified interests in the user behavior sequence in Figure 3, i.e., spirit, cat and condiment, encoders without the WHA layer and the temperature-cosine produces cats only (Figure 3), while the results of our encoder perceive all these interests well.
5 Related Work
Large scale deep candidate generation approaches. Deep candidate generation approaches are widely applied in industrial recommender systems. (Covington et al., 2016) first introduces a DNN based candidate generation method on a million-size item set Joglekar et al. (2019). Attempts on larger scale candidate set have also been made from homogeneous item graph embedding Zhou et al. (2017); Wang et al. (2018c) to sequential models Li et al. (2019); Zhu et al. (2018). These works usually optimize the log-likelihood of softmax by either negative sampling (Mikolov et al., 2013) or sampled-softmax (Jean et al., 2014; Bengio and Senécal, 2008). (Ying et al., 2018) employs graph convolutional network to conduct recommendation on a pin-board bipartite graph using a max-margin-based loss. Caselles-Dupré et al. (2018) evaluates different proposal distribution for various recommendation dataset, showing an undesirable sensitivity on the choices of proposal distribution for negative sampling in recommender systems. Debiasing is considered in (Chen et al., 2019)
, which requires a relative complex reinforcement learning pipeline, or in(Ai et al., 2018) which jointly learns an unbiased ranker and an unbiased propensity model.
Contrastive learning. Recently there emerges another line of research that focus on self-supervised pretrain models. They propose contrastive learning methods Wu et al. (2018b) and achieve many successes in audio (Oord et al., 2018)et al., 2018; He et al., 2019), graph neural networks Veličković et al. (2018), or world models (Kipf et al., 2019). The contrastive loss used in our paper is also used in (Sohn, 2016), or referred to as InfoNCE in (Oord et al., 2018). Minimizing this loss is proved to be equivalent as maximizing the lower bound of mutual information of the two input variables (Oord et al., 2018).
Implicit Sampling Strategies. Sharing sampling negative examples is explored in recommendation literature Hidasi et al. (2015); Ying et al. (2018); Chen et al. (2017). Ying et al. (2018) propose to share negative examples sampled from the entire candidate set, while Chen et al. (2017) proposes a similar intra-batch sharing method. However, all these works optimize towards a binary cross entropy, which shows poor performance for our problem scale.
We propose a contrastive learning framework for deep candidate generation in recommender systems, which does not incur explicit sampling in the training process. It achieves high efficiency while showing better performance due to its debias effects in our live experiment. It has great potentials to incorporate with item features of complex data types in an extremely efficient way, or user-pair self-supervision which we will further investigate in future work.
- Improving aggregate recommendation diversity using ranking-based techniques. IEEE Transactions on Knowledge and Data Engineering 24 (5), pp. 896–911. Cited by: §4.2.3, Table 5.
- Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 385–394. Cited by: §5.
- Personalized bundle list recommendation. In The World Wide Web Conference, pp. 60–71. Cited by: §1.
- Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks 19 (4), pp. 713–722. Cited by: §1, §2.2, §5.
Word2vec applied to recommendation: hyperparameters matter. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 352–356. Cited by: §1, §2.2, §5.
- Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456–464. Cited by: §1, §5.
- On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 767–776. Cited by: §3.2.1, §5.
- Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. Cited by: Appendix B, Contrastive Learning for Debiased Candidate Generation at Scale, §1, §1, §1, §2.2, §2.2, §5.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: Appendix B.
- Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 311–320. Cited by: §1, §1.
- Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §1, §2.2.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §3.1, §3.2.3, §3.2.3, §4.2.3, §5.
- An unsupervised neural attention model for aspect extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 388–397. Cited by: Appendix B, §4.2.1.
Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939. Cited by: §3.2.1, §5.
- Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §5.
On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007. Cited by: 3rd item, §2.2, Table 3, §5.
- Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471. Cited by: §5.
- Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §1, §4.2.2.
- Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: §2.2, §4.1.
- Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247. Cited by: §5.
- Multi-interest network with dynamic routing for recommendation at tmall. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2615–2623. Cited by: §4.2.3, Table 2, §5.
- Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pp. 689–698. External Links: Cited by: §2.2.
- A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: Appendix B.
- Statistical analysis with missing data. Vol. 793, John Wiley & Sons. Cited by: §3.3.
- Learning disentangled representations for recommendation. In Advances in Neural Information Processing Systems, pp. 5712–5723. Cited by: Appendix B, §1, §4.2.3.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: 1st item, §1, §2.2, Table 3, §5.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §3.1, §5.
- Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: §3.3.
- Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §3.1, §5.
- Distributed negative sampling for word embeddings. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: 1st item, §2.2.
- BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1441–1450. Cited by: §2.2, §4.1.
- Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: Appendix B.
- Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §5.
Cosface: large margin cosine loss for deep face recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: Appendix B.
Ripplenet: propagating user preferences on the knowledge graph for recommender systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 417–426. Cited by: §1.
- Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 839–848. Cited by: §5.
- Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 115–124. Cited by: §1, §3.3, §3.3.
- Deep cosine metric learning for person re-identification. In 2018 IEEE winter conference on applications of computer vision (WACV), pp. 748–756. Cited by: Appendix B.
- Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 685–701. Cited by: §3.2.3.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §5.
Graph convolutional neural networks for web-scale recommender systems. arXiv preprint arXiv:1806.01973. Cited by: 2nd item, §1, Table 3, §5, §5.
- ATRank: an attention-based user behavior modeling framework for recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Appendix B, Appendix B, §2.2.
- Scalable graph embedding for asymmetric proximity. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §5.
- Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5941–5948. Cited by: §1, §1, §2.2.
- Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §1.
- Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079–1088. Cited by: §1, §4.2.3, §5.
- AliGraph: a comprehensive graph neural network platform. Proceedings of the VLDB Endowment 12 (12), pp. 2094–2105. Cited by: 1st item, §4.2.3.
Appendix A Discussin About Debias
As and are both impression distributions, we are actually replacing the IPW term with , which is the overall click through rate of an item , under the impression policy . We have
From the above equations, we can see that, if the user behavior is random, i.e., click an item randomly when it’s recommended, then we have . If we consider the to be category-specific value, i.e., where c is the category of y, then
if category is a popular category e.g., clothes, it may be over-punished.
Appendix B Implementations of Encoders
Now we share the detailed implementations of our user/item encoder. In comprehensive e-commerce platform, users usually have diversified interests from food to clothe to meet different daily demands, leading to a diversified behavior sequence. We employ the multi-head self-attention module to capture different user interests revealed by the input sequence as the basis of the sequence encoder, while we use a weighted head aggregation, denoted as WHA on top of the representations of each head, which is the same as in (He et al., 2017). The overall sequence encoder is illustrated as follows,
Next we will introduce the design choice of each part.
Item Encoder consists of several parts, a basic item embedding according to its unique identity, a time bucket embedding, a reverse positional embedding and a concatenated item feature embedding.
Time Bucket Embedding. The time of any user behavior is critical for online recommendation system, and it’s not trivial for LSTM to encode different time intervals between consecutive behaviors. We use time bucketized embedding which is proposed in Zhou et al. (2018a), i.e., given the current clock time , the relative behavior time of each behavior can be calculated as which locates in the bin . Each time bucket will have an individual representation.
Reverse Positional Embedding. We also find positional embedding has extra positive impacts even time bucketized embedding is considered. There are basically two ways to add them, and we surprisingly find reverse positional embedding improve the performance while the forward positional embedding is not quite good. We suspect that time bucketized embedding has already captured the recency, and the reverse positional embedding behaves like the reverse link in bi-LSTM.
Feature Embedding. We exploit similar ways to encode categorical and continuous features as in Covington et al. (2016). We concatenate these feature representations and multiple by a projection matrix.
The final item embedding is the summation of these four parts. Label-side item encoder sets both time bucket embedding and reverse positional embedding as zero vector.
Multi-Head Self-Attention. Multi-head self-attention Vaswani et al. (2017) achieves great successes in various NLP tasks and is adapted to sequential recommendation models Zhou et al. (2018a). It is engineer-efficient than RNN and has superior power to capture multiple semantics of the behaviors with certain interpretabilities. Here we implement self-attention by a much simpler version as in Lin et al. (2017),
where is the input embedding, ffn
is a two-layer feed forward neural network whose output is the attention scorewhose softmax value represents the item weight distribution in each head.
Weighted Heads Aggregation (WHA). After multi-head self-attention, there are head representations . Compared to a common and straight forward way that concatenates or averages all the heads and feeds them into a feedforward layer, WHA first calculates a global feature vector of based on which another weighted attention is conducted to produce the final user representation as follows,
We find WHA to be more effective than solely average or concatenation. WHA is sensitive to the similarity metrics to be used. Note that is necessary to improve the convergence, as well as to help improve the diversity in our neural network. Wang et al. (2018a); Wojke and Bewley (2018) gives analysis about cosine-based softmax and similar benefits of using cosine-based softmax are also found in Ma et al. (2019). According to Wojke and Bewley (2018), if
follows Gaussian distribution, cosine softmax will force the conditionaluser embedding to be more compact, making KNN search more easily. Embedding initialization is better set to be glorot normal Glorot and Bengio (2010)
instead of any uniform distribution so that the initial embeddings can be spread over the high-dimensional sphere, which can avoid loss thrashes in the training procedure.
In our live experiment, we also find WHA to be critical in this encoder architecture to achieve diversified recommendation list that reflects different intents of users in their historical sequences. Without WHA, we observe that the retrieved item set are easily dominated by one intention, showing limited diversity.
Appendix C Details in Experiments
c.1 Offline Dataset Description
Dataset statistics can be found at Table A1.
c.2 Training Environment
Training for large-scale production purpose uses a distributed tensorflow cloud. Each worker is assigned with a GPU of GTX 1080 Ti or Tesla P100. Each ps server has allocated a 32G memory while each worker has 16G.
c.3 Implementations of Loss and Sampling alternatives
We list the detailed implementations of loss and sampling strategies in Table 3.
Negative Sampling Mikolov et al. (2013). We sample examples for each positive pair from a proposal distribution according to the degree as recommended in Mikolov et al. (2013). We implement the distributed sampling via alias method Stergiou et al. (2017) using AliGraph Zhu et al. (2019). We tune range from and report the best.
Shared Negative Sampling Ying et al. (2018). For a given batch of positive samples, Ying et al. (2018) adopts a shared negative sampling strategy that samples examples according to a proposal distribution, and regards each as a negative sample. Note that, the training loss follows cross entropy of binary classification.
Sampled softmax Jean et al. (2014). We use the sampled softmax implementation of Tensorflow which keeps the correction term and sample negative samples per batch. This is also a shared example implementation. We use log uniform sampler (LogUni) as our proposal distribution choices.
Intra-Batch Contrastive Loss. CLRec-intra as discussed in Section 3.2.1. The batch size is 2560.
Inter-Batch Contrastive Loss. CLRec-inter as discussed in Section 3.2.2. The batch size is 256 and the queue size is 2560.
c.4 Self-supervised user-user training via momentum contrast
With the help of MOCO, we also investigate a user-user contrastive learning paradigm under the CLRec framework, where and both indicate a user. In this setting, encoding queries and keys are both expensive, since they are both sequences. MOCO is therefore a practical training strategy for saving the usage of memory and computational power. To be specific, we train the model by adding another user-to-user prediction loss, which is optimized along with the original user-to-item prediction loss. A query in these two kinds of losses are both a user’s click sequence. However, the key in the user-to-user prediction loss is a future sequence of five items, clicked by the same user, that happens after the query sequence. The future sequence is encoded using a sequence encoder that has the same shape as the sequence encoder for encoding queries. In contrast, the key in the original user-to-item setting is simply the next item being clicked, and involves just a look-up of the embedding table. In our implementation, we only buffer all the attention score in WHA layer at Equation 15 in whose previous layers are updated following the Equation 5. However, the final layer of item embedding table in is directly udpated via back-propogation, as a tradeoff for faster convergence due to the large scale of our embedding table and the sparsity of our training data.