Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems

05/20/2020 ∙ by Chang Zhou, et al. ∙ 0

Deep candidate generation (DCG) that narrows down the collection of relevant items from billions to hundreds via representation learning is essential to large-scale recommender systems. Standard approaches approximate maximum likelihood estimation (MLE) through sampling for better scalability and address the problem of DCG in a way similar to language modeling. However, live recommender systems face severe unfairness of exposure with a vocabulary several orders of magnitude larger than that of natural language, implying that (1) MLE will preserve and even exacerbate the exposure bias in the long run in order to faithfully fit the observed samples, and (2) suboptimal sampling and inadequate use of item features can lead to inferior representations for the unfairly ignored items. In this paper, we introduce CLRec, a Contrastive Learning paradigm that has been successfully deployed in a real-world massive recommender system, to alleviate exposure bias in DCG. We theoretically prove that a popular choice of contrastive loss is equivalently reducing the exposure bias via inverse propensity scoring, which provides a new perspective on the effectiveness of contrastive learning. We further employ a fix-sized queue to store the items' representations computed in previously processed batches, and use the queue to serve as an effective sampler of negative examples. This queue-based design provides great efficiency in incorporating rich features of the thousand negative items per batch thanks to computation reuse. Extensive offline analyses and four-month online A/B tests in Mobile Taobao demonstrate substantial improvement, including a dramatic reduction in the Matthew effect.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale industrial recommender systems usually adopt a multi-stage pipeline, where the first stage, namely candidate generation, is responsible for retrieving a few hundred relevant entities from a billion scale corpus. Deep candidate generation (DCG) (Covington et al., 2016)

, a paradigm that learns entity vector representations to enable fast k-nearest neighbor retrieval 

(Johnson et al., 2017), has become an essential part of many live industrial systems with its enhanced expressiveness and improved flexibility compared to traditional collaborative filtering based frameworks.

Typical large-scale DCG models (Covington et al., 2016; Grbovic and Cheng, 2018; Zhu et al., 2018; Li et al., 2019) regard the problem of identifying the most relevant items to the users as estimating a multinomial distribution traversing over all items for each user, conditional on the user’s past behaviors. Maximum likelihood estimation (MLE) is the conventional principle for training such models. Apparently, exact computation of the log likelihood, which requires computing softmax looping over a million- or even billion-scale collection of items, is computationally infeasible. Sampling is thus crucial to DCG. Among the various sampling strategies, sampled-softmax (Bengio and Senécal, 2008; Jean et al., 2014) usually outperforms the binary-cross-entropy based approximations such as NCE (Gutmann and Hyvärinen, 2010) and negative sampling (Mikolov et al., 2013).

However, the MLE paradigm and the sampling strategies mainly stem from the language modeling community, where the primary goal is to faithfully fit the observed texts. Indeed, live recommender systems are different from natural language texts in several aspects. Firstly, the training data are collected from current undergoing systems that might be sub-optimal and biased towards popular items. In such situations, some high-quality items can be under-explored in the training data, while an algorithm trained via MLE will continue to under-estimate the relevance of the under-explored items in order to faithfully fit the observed data. Secondly, the set of items for recommendation is much larger than the vocabulary of natural languages, e.g.  items in our system compared to

words. Many items may never be sampled in an epoch.

In this paper, we introduce CLRec, a practical Contrastive Learning framework for debiased DCG in RECommender systems. We establish the theoretical connection between contrastive learning and inverse propensity weighting, where the latter is a well-studied technique for bias reduction (Rosenbaum and Rubin, 1983; Imbens and Rubin, 2015; Thompson, 2012). Our theoretical result complements the previous studies of contrastive learning (Oord et al., 2018).We then present an easy-to-implement efficient framework to practically reduce the exposure bias of a large-scale system based on the theoretical result. In particular, our implementation maintains a fixed-size first-in first-out (FIFO) queue (He et al., 2019) to accumulate positive samples and their representations from the previous batches, and use the content of the queue to serve as the negative samples to be discriminated from the positive samples of the next batch. It guarantees that all items will be sampled sometime in an epoch to serve as the negative examples. More importantly, it allows us to reuse the computed results from the previous batches, e.g., saving 90% computation cost when the queue size is of the batch size. As a result, we can afford to encode complex features of the negative samples even when the negative sample size, i.e., the queue size, is very large, where the complex features may improve the quality of the learned representations of the under-explored items.

CLRec has been fully deployed into our live system as the default choice to serve billions of page views each day since February 2020. We observe that CLRec is capable of recommending high-quality items that are largely neglected by most current on-going systems, and it consistently outperforms the previous state-or-art baselines regarding the online recommendation performance.

2 The Proposed Framework

In this section, we prove the connection between contrastive learning and inverse propensity weighting (IPW), based on which we propose a practical framework for bias reduction in large-scale DCG.

2.1 Problem Formulation


Given a dataset of user clicks , where represents a user’s clicks prior to the th click , and denotes the number of clicks from the user . We will drop the sub-scripts occasionally and write in place of for conciseness. We use to refer to the set of all possible click sequences, i.e. . Each represents a clicked item, which includes various types of features associated with the item, while is the set of all possible items. The features of could be in any form, e.g., the item’s unique identifier number, embeddings or raw data of its image and text description. The number of items easily reaches 100 million in large live systems.

Deep Candidate Generation

The deep candidate generation paradigm involves learning a user behavior encoder and an item encoder . The set of parameters used by each encoder is a subset of , i.e. the set of all trainable parameters in the system. It then takes and builds a k-nearest-neighbor search service, e.g., Faiss (Johnson et al., 2017). As a result, given an arbitrary user behavior sequence at serving time, we can instantly retrieve the top items relevant to the user by finding the top candidate similar to . Most implementations use inner product

or cosine similarity as the similarity score. The typical learning procedure fits the data following the maximum likelihood estimation (MLE) principle:


The denominator of sums over all possible items, which is infeasible in practice and thus requires approximation, e.g., via sampling. However, the observed clicks for training are from the previous version of the recommender system. The training data thus suffer from exposure bias (i.e. missing not at random (Schnabel et al., 2016)) and reflect the users’ preference regarding the recommended items rather than all potential items. High-quality items that have few clicks in the training data will likely remain under-recommended by a new algorithm trained via the MLE paradigm.

2.2 Understanding Contrastive Learning from a Bias-Reduction Perspective

We now introduce the family of contrastive losses that we are interested in, and reveal their connection with the inverse propensity weighting (IPW)  (Rosenbaum and Rubin, 1983; Imbens and Rubin, 2015; Thompson, 2012) techniques for bias reduction.

Sampled Softmax

The kind of contrastive loss we will investigate is strikingly similar to sampled softmax. We thus recap sampled softmax here and will show that the minor difference is crucial later. There are many variants of sampeld softmax (Bengio and Senécal, 2008; Jean et al., 2014)

, among which the following variant is integrated by TensorFlow 

(Abadi et al., 2016) and popular for industrial systems:


where are negative samples drawn from a pre-defined proposal distribution . Subtracting is necessary for it to converge to the same solution as the exact loss in Eq. (1). Most implementations assume and set somehow proportional to the popularity of the items to improve convergence. In practice, we would draw thousands of negative samples to pair with each positive example. Sampled softmax in general outperforms other approximations such as NCE (Gutmann and Hyvärinen, 2010) and negative sampling (Mikolov et al., 2013) when the vocabulary is large (McCallum et al., 1998; Liang et al., 2018; Covington et al., 2016; Jean et al., 2014).

Contrastive Loss

We study the following type of contrastive loss (Sohn, 2016; Oord et al., 2018):


where are again sampled from for each to pair with the positive sample. It no longer optimizes the MLE loss in Eq. (1), because it misses and thus does not correct the bias introduced by sampling. Many efforts on contrastive learning have been focusing on designing a well-performing proposal distribution  (Sohn, 2016; Caselles-Dupré et al., 2018). InfoNCE (Oord et al., 2018) demonstrates that this loss maximizes a lower bound of the mutual information between and if we set the proposal distribution to be the actual data distribution , i.e. if we sample proportional to its frequency in the dataset.

Contrastive Learning and Exposure Bias Reduction

The contrastive loss as shown above in Eq. (3) has recently achieved remarkable success in various fields, e.g., visual representation learning (He et al., 2019). Nevertheless, it still remains a question why the loss is effective. We will reveal that the contrastive loss is a sampling-based approximation of an inverse propensity weighted (IPW) loss. The IPW loss 111 Note that our IPW loss is different from the previous works on debiased recommenders Schnabel et al. (2016); Wang et al. (2016). We focus on multinomial propensities, i.e. whether an item is selected and recommended by a recommender out of all possible items. The previous works consider bernoulli propensities related with users’ attention, i.e. whether a user notices a recommended item or not, and mostly deal with the position bias in ranking. , whose derivation can be found in Section A of the supplemental material, is:



should be the propensity score function, which represents the probability that item

is recommended to user when we were collecting the training data . The idea of IPW is to model missing-not-at-random via the propensity scores in order to correct the exposure bias. We provide a proof in Section A of the supplemental material that the IPW loss is optimizing to capture the oracle user preference even when there exists exposure bias. A standard implementation of the IPW loss has two steps, where the first step is to use a separate model to serve as and optimize it by fitting the exposure history according to the MLE principle, while the second step is to optimize according to Eq. (4). However, the two-stage pipeline of IPW, as well as the numerical instability brought by , makes IPW impractical for large-scale production systems.

Fortunately, we can prove that the contrastive loss Eq. (3) is in principle optimizing the same loss as Eq. (4). And in Subsection 2.3 we provide a simple implementation that does not require two separate steps and avoids the instability from the division of . Our key theoretical result is as follows:

Theorem 1.

The optimal solutions of the contrastive loss (Eq. 3) and the IPW loss (Eq. 4) both minimize the KL divergence from to . Here is the data distribution, i.e. what is the frequency of apprearing in given context .


We now give a proof sketch on the theorem. We will focus on one training instance, i.e. one sequence . The IPW loss (Eq. 4) for training sample is . The IPW loss is thus minimizing the Kullback–Leibler (KL) divergence from to .

Let us now focus on the contrastive loss for the training sample . Let , where is the positive example and are the negative samples drawn from . Note that is a multi-set where we allow the same item to appear multiple times. The contrastive loss (Eq. 3) for equals to where if or if , since by definition must include if we know that the positive example is .

Let . We then have if includes . As a result, we can see that the contrastive loss for training sample is proportional to , which is equal to Here we use and

to refer to the probability distributions

and , whose supports are . Since we are minimizing the KL divergence under all possible , the global optima will be the ones that make equal to for all if is expressive enough to fit the target distribution arbitrarily close. Note that

is indeed expressive enough since we implement it as a neural network, due to the universal approximation theorem 

(Cybenko, 1989; Hornik, 1991). The two losses hence have the same global optima. ∎

Remark 1.

The implication of Theorem 1 is that the contrastive loss (Eq. 3) can approximately reduce the exposure bias if we set the proposal distribution to be the propensity score, i.e. the probability that the old systems deliver item to user when we were collecting the training data .

2.3 Practical Implementations of Contrastive Learning for Large-Scale DCG

Figure 1: Three implementations of CLRec , whose implicit proposal distributions are . The superscripts mean positive and negative examples respectively. Implementation (a) uses the positive examples of other instances in the present batch as the negative examples. Implementation (b) creates a fixed-size FIFO queue to store the positive examples encountered in previously processed batches, and use the examples stored in the queue to serve as the negative examples for the present batch. Implementation (c) differs from implementation (b) in that the queue caches the computed representations rather than storing the raw features of .

We now describe our implementations of the debiased contrastive loss for large-scale DCG.

We first note that the propensity score is not easy to estimate in practice, because industrial recommender systems involve many complicated stages and the data are also highly sparse. Moreover, some theoretical results (Schnabel et al., 2016)

have pointed out that small propensities can lead to high variance that harms overall performance, and thus the accurate propensity scores may not perform better than the smoothed inaccurate propensities. We thus use

in place of , i.e. assuming , to ease estimation and avoid small propensities. Secondly, (i.e. the probability that item is recommended to some user) has an extremely high correlation with , i.e. the probability that item is being recommended and clicked by someone, because the existing system will mainly recommend items that have a high click-through rate if it is already highly optimized. We thus further replace with , i.e. assuming , to ease implementation. In summary, we assume , which allows us to draw negative samples from directly when implementing the contrastive loss for bias reduction. We thus, unlike the IPW methods, do not need to introduce an extra stage into the training pipeline.

However, sampling will still incur non-negligible overheads, e.g. communication costs, in a distributed environment (Stergiou et al., 2017), and cannot guarantee that every item will be sampled in an epoch. We thus adopt a queue-based design (He et al., 2019) that avoids explicitly performing sampling, as shown in Figure 1b and Figure 1c. To be specific, we maintain a first-in first-out (FIFO) queue , which has a fixed capacity and can store examples. Given a new batch to process, we first enqueue the positive examples (or their representations ) encountered in the present batch into . We then use the examples stored in the queue as (including one positive example and negative examples) to compute the denominator of the contrastive loss (Eq. 3) for the present batch. In a distributed setting, each worker maintains its own queue locally to avoid communication costs. When the queue size is equal to the batch size, our implementation is then equivalent to sampling negative examples from the present batch (Sohn, 2016; Hidasi et al., 2015; Chen et al., 2017) (see Figure 1a). In general, we need thousands of negative samples to achieve satisfying performance. We therefore use a large queue size, but with a small batch size to prevent out-of-memory, e.g. batch size and queue size .

With the implementation that caches (see Figure 1

c), we can no longer back-propagate through the negative examples from the previous batches, though we can still back-propagate through the negative examples from the present batch. As a result, we find that the total training steps required for convergence mildly increase. However, since each training step will take much less time to compute (and less communication costs in a distributed training setting), the total running time can still be greatly reduced if the features of the negative items are expensive to encode, e.g. if the features contain raw images, texts, or even structured data such as a knowledge graph.

Aggregated Diversity
sampled-softmax 10,780,111
CLRec 21,905,318
Table 2: CLRec vs. the sampling-based alternatives. We conducted these proof-of-concept live experiments in a small-traffic scenario, due to the costs of online experiments. The negative sampling baseline has been outdated and removed from our live system before this work starts.
Method HR@50 CTR(online)
negative sampling 7.1% outdated
shared negative sampling 6.4% -
sampled-softmax 17.6% 3.32%
CLRec 17.8% 3.85%
figure The total number of impressions of the items in a specific degree bucket vs. the logarithm of the corresponding degree. The rightmost bar is not the highest because the number of the extremely popular items is small, even though each item in the bucket has an extremely high degree.
Table 1: Aggregate diversity (Adomavicius and Kwon, 2011), i.e. the number of distinct items recommended to a random subset of users.
Method CTR Average Dwell Time Popularity Index of Recommneded Items
MIND 5.87% - 0.658 (high tend to recommend popular items)
CLRec 6.30% +11.9% 0.224 (low fair to under-explored items)
Table 3: Main live experiment results conducted in one of the largest scenarios on our platform. CLRec consistently outperforms the baseline for months and has been fully deployed since Feb 2020.
Method Examples Per Second Network Traffic (MB/s)
sampled-softmax + negatives w/o features
sampled-softmax + negatives with features
CLRec + negatives w/o features
CLRec + negatives with features
Table 4: Efficiency. We report the training speed in terms of the number of the positive examples processed per second, and the average network traffic of the workers in a distributed environment.
Task & Implementation HR1@50 HR5@50
CLRec-u2i 17.9% 12.1%
CLRec-u2u, cached 18.3% 12.7%
CLRec-u2u, cached + MoCo 18.2% 12.6%
Table 6: The benefits of encoding features for the negative samples. Most baselines that employ sampled-softmax do not encode rich features for the negative samples (though they still use features when encoding the users’ click sequences), because the number of negative samples is large and brings high costs if the features are complex. Fortunately, CLRec’s cached implementation greatly reduces the costs, as demonstrated in Table 4.
Method HR@50
CLRec + negatives w/o features 17.4%
CLRec + negatives with features 19.4%
Table 5: Task u2i is the regular task where is a sequence of clicks and is the next click to be predicted. Task u2u adds an auxiliary loss where and are both sequences from the same user (before and after a sampled timestamp), which is co-trained with task u2i. Task u2u is a complex pretext task that requires the cached implementation due to high costs when encoding the negative samples. HR1@50 and HR5@50 represent HR@50 for predicting next one and five clicks respectively.
Method Metric ML-1M Beauty Steam Metric ML-1M Beauty Steam
SASRec HR@1 0.2351 0.0906 0.0885 NDCG@5 0.3980 0.1436 0.1727
BERT4Rec 0.2863 0.0953 0.0957 0.4454 0.1599 0.1842
CLRec 0.3013 0.1147 0.1325 0.4616 0.1876 0.2396
Improv. +5.2% +20.4% +38.4% +3.6% +17.3% +30.0%
SASRec HR@5 0.5434 0.1934 0.2559 NDCG@10 0.4368 0.1633 0.2147
BERT4Rec 0.5876 0.2207 0.2710 0.4818 0.1862 0.2261
CLRec 0.6045 0.2552 0.3413 0.4988 0.2156 0.2852
Improv. +2.9% +15.6% +25.9% +3.5% +15.8% +26.1%
SASRec HR@10 0.6629 0.2653 0.3783 MRR@1 0.3790 0.1536 0.1874
BERT4Rec 0.6970 0.3025 0.4013 0.4254 0.1701 0.1949
CLRec 0.7194 0.3423 0.4829 0.4417 0.1968 0.2457
Improv. +3.2% +13.1% +20.3% +3.8% +15.7% +26.0%
Table 7: Results on public benchmarks to ensure reproducibility. For fair comparision, the CLRec implementation here uses the same Transformer Lin et al. (2017) encoder as SASRec but with a contrastive loss.

3 Experiment

In this section, we report both offline and online results in a large-scale e-commerce recommender system, as well as offline results on public datasets to ensure reproducibility.

3.1 Online and Offline Experiments in Production Envrionments

The online experiments have lasted for four months, and our algorithm serves several scenarios with different traffic volumes. We leave some details to the supplemental material, e.g., hyper-parameters and evaluation metrics as well as the features used in Section B, details of the encoders in Section C, and details of the baselines in Section D. The total number of items for recommendation is around 100 million. We use the queue-based implementation without caching in this section, while we will explore settings where encoding is much more expensive and requires caching in Subsection 


3.1.1 The Debiasing Effect in Large-Scale Production Environments

To examine the debiasing effects of CLRec, we first conduct offline experiments and compare CLRec with sampled softmax. We report the aggregated diversity (Adomavicius and Kwon, 2011) in Table 2, and the distributions of the recommended items resulted from the different losses in Figure 2.

Table 2 shows that CLRec has an over improvement on aggregated diversity. We see from Figure 2 that, sampled softmax tends to faithfully fit the distribution of the training data. CLRec, however, learns a relatively different distribution, which shifts towards the under-explored items. These results suggest that CLRec does result in a fairer algorithm that alleviates the “rich-get-richer” phenomenon.

This debiasing effect not only leads to a fairer system, but also contributes to a significant improvement regarding the online performance. In Table 2, we compare CLRec with other sampling alternatives using the same implementation of the encoders. Details of the alternative methods can be found in Section D of the supplemental material. We observe that negative sampling Mikolov et al. (2013), including its variant Ying et al. (2018) that makes the instances in a batch share the same large set of negative samples, does not perform well in our settings. CLRec’s improvement over sampled-softmax Jean et al. (2014); Bengio and Senécal (2008) w.r.t. the offline metric HitRate@50 is negliable. However, CLRec achieves significant improvement regarding the click-through rate (CTR) online. This indicates that there exists discrepancy between the offline metric and the online performance.

3.1.2 Four Months’ Large-Scale A/B testing

CLRec has been fully depolyed into several heavy-traffic scenarios since Feb 2020, after the initial proof-of-concept ablation studies shown in Table 2. Table 3 shows our main online results conducted in these heavy-traffic scenarios, with billions of page views each day. During the four months’ A/B testing, CLRec has been consistently outperforming MIND (Li et al., 2019), the previous state-or-art baseline, in terms of the fairness metrics such as aggregated diversity and average popularity index, as well as the user engagement metrics such as click-through rate and average dwell time 222See Subsection B.2 of the supplemental material for the definitions of the fairness and engagement metrics..

Compared with MIND (Li et al., 2019), which is the previous state-of-art DCG baseline deployed in the system, CLRec tends to recommend items with a lower popularity index while being more attractive to the users. This proves CLRec’s ability of identifying high-quality items that are rarely explored by the previous systems. We further achieve a % relative improvement in the total click number on our platform after ensembling CLRec and MIND, compared to using MIND as the only DCG method.

3.2 Computational Feasibility of CLRec on Complex Pretext Tasks


Table 4 compares CLRec and sampled-softmax in terms of training speed and the network traffic required in a distributed setting. CLRec’s queue-based implementation is much more efficient than the methods that perform explicit sampling, since CLRec reuses the result computed for a positive sample shortly later when the sample is serving as a negative sample. The version of sampled-softmax that encodes features for the negative items is from Zhu et al. (2019). This proof-of-concept experiment only uses categorical features of the items, and we can deduce that the improvement regarding efficiency will be much more substantial with more complex features. Table 6 shows that encoding features for the negative samples is beneficial, which justifies the efforts spent on efficiency.

Complex Pretext Task that Requires the Cached Implementation

We now demonstrate that CLRec with a queue that caches the computed results can enable more complex pretext tasks that may improve the quality of the learned representations. To be more specific, we consider an auxiliary task where and are both sequences from the same user (before and after a sampled timestamp). The goal is to identify the correct sequence that belongs to the same user that produces . More details can be found in Section E of the supplemental material. This auxiliary task is previously too expensive to implement with sampled-softmax, since the negative samples are sequences and are thus expensive to encode. Fortunately, cached CLRec can implement this task efficiently. Table 6 demonstrates that the auxiliary task can improve an algorithm’s ability to make long-term prediction.

MoCo (short for momentum contrast) He et al. (2019) proposes a momentum method for updating the encoders based on the cached results and for stabilizing the training loss. We observe no gain with MoCo, possibly because (1) our model is shallow compared to those for visual tasks, and (2) we have a large embedding table which serves as a consistent dictionary that prevents the loss from oscillating.

3.3 Experiments on Public Dataset

To ensure reproducibility, we also conduct experiments on three public datasets from the existing approaches (Kang and McAuley, 2018; Sun et al., 2019). Our source code will be released. We strictly follow the settings and metrics used by BERT4Rec (Sun et al., 2019) and report the results in Table 7. Note that the metrics used by BERT4Rec (Sun et al., 2019) penalizes false positive predictions on popular negative items. As a result, CLRec achieves a significant performance gain thanks to bias reduction. Qualitative results that illustrate the debiasing effects, similar to those in Subsection 3.1.1, can be found in Section F of the supplementary material.

4 Related Work

Deep Candidate Generation

Deep candidate generation methods are widely deployed in industrial systems, e.g., YouTube (Covington et al., 2016; Joglekar et al., 2019; Chen et al., 2019), Taobao (Zhou et al., 2017; Li et al., 2019; Ma et al., 2019; Zhu et al., 2018), and Pinterest (Ying et al., 2018). The existing methods explicitly sample negative examples from a pre-defined proposal distribution (Mikolov et al., 2013; Jean et al., 2014; Bengio and Senécal, 2008). The proposal distribution not only affects convergence, but also has a significant impact on the performance (Caselles-Dupré et al., 2018). Empirically the number of the negative samples need to be large, e.g. a few thousand ones for pairing with a positive example. Consequently, it is computationally expensive to incorporate rich features for the negative samples. The existing systems hence usually choose to not encode features for the negative examples except for simple features such as item IDs (Covington et al., 2016), even though rich features for the negative samples are demonstrated to be beneficial (Bai et al., 2019). CLRec achieves great efficiency when encoding rich features for the negative samples by caching the computed results.

Bias Reduction and Fairness in Recommender Systems

Recommendation algorithms that directly fits the training data will suffer from selection bias due to the missing-not-at-random phenomenon (Schnabel et al., 2016), where the previous recommendation algorithms affect the training data collected. The topic of reducing the bias in training and evaluating recommender systems has been explored before (Steck, 2013; Ai et al., 2018; Wang et al., 2016; Schnabel et al., 2016; Chen et al., 2019; Yang et al., 2018). However, these existing works mostly focus on small-scale offline settings, and rely on techniques impractical for large-scale DCG. For example, most of them involve an extra stage to train a propensity score estimator. We also find that dividing the propensity scores leads to numerical instability and thus fail to achieve satisfying results. Correcting the bias helps improve P-fairness (Burke, 2017), i.e. fairness towards the previously under-recommended products (Beutel et al., 2019).

Contrastive Learning

Contrastive learning, which aims to learn high-quality representations via self-supervised pretext tasks, recently achieves remarkable successes in various domains, e.g., speech processing (Oord et al., 2018)

, computer vision 

(Hjelm et al., 2018; He et al., 2019), graph data Veličković et al. (2018), and compositional environments (Kipf et al., 2019). The contrastive loss we investigate in this paper is a generalization of the InfoNCE loss (Oord et al., 2018). InfoNCE is previously understood as a bound of the mutual information between two variables (Oord et al., 2018). Our work provides a new perspective on the effectiveness of the contrastive loss, by illustrating its connection with inverse propensity weighting (Rosenbaum and Rubin, 1983; Imbens and Rubin, 2015; Thompson, 2012) , a well-known technique for bias reduction.

5 Conclusion

We established in theory the connection between contrastive learning and bias reduction. We then proposed CLRec, a contrastive learning framework for debiased candidate generation, which may lead to a fairer system and can achieve high efficiency when encoding features of complex data types.

Broader Impact

Positive Impact

The debiasing effect of our proposed framework CLRec helps to address the producer fairness (i.e. P-fairness) problem (Burke, 2017) in recommender systems, so that high quality items that are previously under-explored get more chances of being presented to the users. It leads to a fairer eco-system. While this paper takes a live e-commerce recommender system as an example to illustrate the benefits of CLRec, we would like to highlight that CLRec can be applied to other domains such as search engines, advertising, the retrieval phase in open domain question answering, as well as other types of recommender systems. Applying CLRec to the various domains may bring:

  • Fairer traffic assignment in user-generated content (UGC) platforms.

  • Increased diversity of the information being spread in a society by distributing the voices from the minority groups in a fairer manner.

  • Fairer opportunities in the job market.

Although there are other studies, e.g. IPW-based approaches, that also aim to reduce data bias, they usually suffer from implementation difficulties and numerical instability during optimization. Moreover, few has targeted specifically at the candidate generation stage in modern recommender systems. We also find little public information on how a debiased method will eventually affect a live system. This paper shares the online experiment results lasted for at least four months and reports positive results, which could be valuable to the community.

Negative Impact

A more accurate recommender system means that a user will more easily absorb passively the information that the system presents and the user may become over-reliant on the system. Platforms that provide accurate recommendation service may thus have the power to control what they want their users to see. This a general problem for recommender systems. On the other hand, CLRec prefers under-explored items that have a high potential, and it is not clear whether CLRec will be more prone to adversarial attacks.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §2.2.
  • [2] G. Adomavicius and Y. Kwon (2011) Improving aggregate recommendation diversity using ranking-based techniques. IEEE Transactions on Knowledge and Data Engineering 24 (5), pp. 896–911. Cited by: Table 2, §3.1.1.
  • [3] Q. Ai, K. Bi, C. Luo, J. Guo, and W. B. Croft (2018) Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 385–394. Cited by: §4.
  • [4] J. Bai, C. Zhou, J. Song, X. Qu, W. An, Z. Li, and J. Gao (2019) Personalized bundle list recommendation. In The World Wide Web Conference, pp. 60–71. Cited by: §4.
  • [5] Y. Bengio and J. Senécal (2008) Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks 19 (4), pp. 713–722. Cited by: §1, §2.2, §3.1.1, §4.
  • [6] A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, L. Heldt, Z. Zhao, L. Hong, E. H. Chi, and C. Goodrow (2019) Fairness in recommendation ranking through pairwise comparisons. In KDD, External Links: Link Cited by: §4.
  • [7] R. Burke (2017) Multisided fairness for recommendation. External Links: 1707.00093 Cited by: §4, Positive Impact.
  • [8] H. Caselles-Dupré, F. Lesaint, and J. Royo-Letelier (2018)

    Word2vec applied to recommendation: hyperparameters matter

    In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 352–356. Cited by: §2.2, §4.
  • [9] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi (2019) Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456–464. Cited by: §4, §4.
  • [10] T. Chen, Y. Sun, Y. Shi, and L. Hong (2017) On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 767–776. Cited by: §2.3.
  • [11] P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. Cited by: Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems, §1, §1, §2.2, §4.
  • [12] G. Cybenko (1989)

    Approximation by superpositions of a sigmoidal function

    Mathematics of Control, Signals and Systems. Cited by: §2.2.
  • [13] M. Grbovic and H. Cheng (2018) Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 311–320. Cited by: §1.
  • [14] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    pp. 297–304. Cited by: §1, §2.2.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §2.2, §2.3, §3.2, §4.
  • [16] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2015)

    Session-based recommendations with recurrent neural networks

    arXiv preprint arXiv:1511.06939. Cited by: §2.3.
  • [17] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §4.
  • [18] K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks. Cited by: §2.2.
  • [19] G. W. Imbens and D. B. Rubin (2015) Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Cited by: §1, §2.2, §4.
  • [20] S. Jean, K. Cho, R. Memisevic, and Y. Bengio (2014)

    On using very large target vocabulary for neural machine translation

    arXiv preprint arXiv:1412.2007. Cited by: §1, §2.2, §3.1.1, §4.
  • [21] M. R. Joglekar, C. Li, J. K. Adams, P. Khaitan, and Q. V. Le (2019) Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471. Cited by: §4.
  • [22] J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §1, §2.1.
  • [23] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: §3.3.
  • [24] T. Kipf, E. van der Pol, and M. Welling (2019) Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247. Cited by: §4.
  • [25] C. Li, Z. Liu, M. Wu, Y. Xu, H. Zhao, P. Huang, G. Kang, Q. Chen, W. Li, and D. L. Lee (2019) Multi-interest network with dynamic routing for recommendation at tmall. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2615–2623. Cited by: §1, §3.1.2, §3.1.2, §4.
  • [26] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara (2018)

    Variational autoencoders for collaborative filtering

    In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pp. 689–698. External Links: ISBN 978-1-4503-5639-8, Link, Document Cited by: §2.2.
  • [27] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: Table 7.
  • [28] J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In Advances in Neural Information Processing Systems, pp. 5712–5723. Cited by: §4.
  • [29] A. McCallum, K. Nigam, et al. (1998)

    A comparison of event models for naive bayes text classification

    In AAAI-98 workshop on learning for text categorization, pp. 41–48. Cited by: §2.2.
  • [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §2.2, §3.1.1, §4.
  • [31] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.2, §4.
  • [32] P. R. Rosenbaum and D. B. Rubin (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §1, §2.2, §4.
  • [33] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims (2016) Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: §2.1, §2.3, §4, footnote 1.
  • [34] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §2.2, §2.3.
  • [35] H. Steck (2013) Evaluation of recommendations: rating-prediction and ranking. In Proceedings of the 7th ACM conference on Recommender systems, pp. 213–220. Cited by: §4.
  • [36] S. Stergiou, Z. Straznickas, R. Wu, and K. Tsioutsiouliklis (2017) Distributed negative sampling for word embeddings. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  • [37] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1441–1450. Cited by: §3.3.
  • [38] S. K. Thompson (2012) Sampling. Wiley. Cited by: §1, §2.2, §4.
  • [39] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §4.
  • [40] X. Wang, M. Bendersky, D. Metzler, and M. Najork (2016) Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 115–124. Cited by: §4, footnote 1.
  • [41] L. Yang, Y. Cui, Y. Xuan, C. Wang, S. Belongie, and D. Estrin (2018) Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 279–287. Cited by: §4.
  • [42] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018)

    Graph convolutional neural networks for web-scale recommender systems

    arXiv preprint arXiv:1806.01973. Cited by: §3.1.1, §4.
  • [43] C. Zhou, Y. Liu, X. Liu, Z. Liu, and J. Gao (2017) Scalable graph embedding for asymmetric proximity. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §4.
  • [44] H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, and K. Gai (2018) Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079–1088. Cited by: §1, §4.
  • [45] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou (2019) AliGraph: a comprehensive graph neural network platform. Proceedings of the VLDB Endowment 12 (12), pp. 2094–2105. Cited by: §3.2.