A Causal Perspective to Unbiased Conversion Rate Estimation on Data Missing Not at Random

10/16/2019 ∙ by Wenhao Zhang, et al. ∙ 0

In modern e-commerce and advertising recommender systems, ongoing research works attempt to optimize conversion rate (CVR) estimation, and increase the gross merchandise volume. Even though the state-of-the-art CVR estimators adopt deep learning methods, their model performances are still subject to sample selection bias and data sparsity issues. Conversion labels of exposed items in training dataset are typically missing not at random due to selection bias. Empirically, data sparsity issue causes the performance degradation of model with large parameter space. In this paper, we proposed two causal estimators combined with multi-task learning, and aim to solve sample selection bias (SSB) and data sparsity (DS) issues in conversion rate estimation. The proposed estimators adjust for the MNAR mechanism as if they are trained on a "do dataset" where users are forced to click on all exposed items. We evaluate the causal estimators with billion data samples. Experiment results demonstrate that the proposed CVR estimators outperform other state-of-the-art CVR estimators. In addition, empirical study shows that our methods are cost-effective with large scale dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Selection bias is a widely-recognized issue in recommender systems [schnabel_recommendations_2016, ma_entire_2018, de2014reducing]. For example, music stream services usually suggest genres that have positive user feedback (e.g. favorite, share, and buy, etc), and selectively ignore the ones that are rarely exposed to users [van2013deep]. In this paper, we study the selection bias that exists in post-click conversion rate (CVR) estimation.

Figure 1: Illustration of the selection bias issue in conventional conversion rate (CVR) estimation. A typical e-commerce transaction has the following sequential path: "exposure click (user self-selection) conversion". The training space of conventional CVR models is the click space , whereas the inference space is the entire exposure space . The discrepancy of data distribution between training space and inference space leads to selection bias in conventional CVR models.

Problem formulation

Post-click conversion rate (CVR) estimation is an essential task in e-commerce recommender systems. Generally, conversion rates are used to compute item ranking scores, and we prioritize and recommend items with high ranking scores to our customers. A typical e-commerce transaction has the following sequential path: "exposure click

conversion". Post-click conversion rate indicates the probability of transitions from click to conversion. Typically, when training CVR models, we only consider the items that customers clicked on as we are unaware of the conversion feedback of the items that are not clicked by customers

[huang2007correcting]. Therefore, those unclicked data are excluded from our training dataset. Among the clicked items, the ones which customers purchase are positive samples, whereas others are negative samples. Bear in mind, not clicking on an item does not necessarily indicate the customer is not interested in purchasing it. Customers may unconsciously skip certain items that might be interesting to them. Figure 1 reveals that the exposure space is a super set of the click space . Selection bias comes from the fact that conventional CVR models are trained in the click space, whereas the predictions are made within the entire exposure space (Figure 1) [ma_entire_2018]. Intuitively, data in the click space is drawn from the entire exposure space and is biased by user self-selection (e.g., the interests of users). Therefore, the data distribution in the click space is systematically different from the one in the exposure space. This inherent discrepancy leads to data that is Missing Not At Random (MNAR), and selection bias in the conventional CVR models [de2014reducing, little2019statistical, little2002statistical].

We identify two practical issues that make CVR estimation quite challenging in industrial-level recommender systems:

  • Selection bias: The systematic difference of data distributions between training space (i.e., all user self-selected items) and inference space (i.e., all exposed items) biases conventional CVR models [steck2010training, huang2007correcting, ai2018unbiased]. This bias usually causes the degradation of model performance.

  • Data sparsity: In the CVR estimation task, it refers to the fact that item clicks are rare events (we have a CTR of 5.2% in the production dataset and 4% in the public dataset). Conventional CVR models are typically trained only using data in the click space. Therefore, the number of training samples may not be sufficient for the large parameter space. In our experiments, the numbers are 0.6 billion samples vs. 5.3 billion parameters in production dataset, and 4.3 million samples vs. 2.6 billion parameters in public dataset (see Section 5.1) [lee2012estimating, wang2018billion].

To address the critical issues of selection bias and data sparsity in the CVR estimation, we approach the problem from a causal perspective, and develop causal methods in a multi-task learning framework. We propose two principled, efficient and highly effective CVR estimators, namely, Multi-task Inverse Propensity Weighting estimator (Multi-IPW) and Multi-task Doubly Robust estimator (Multi-DR). Our methods are designed for an unbiased CVR estimation, and also account for the data sparsity issue.

The main contributions of this paper are summarized as follows:

  • To be best of our knowledge, this is the first paper that aims to solve the selection bias issue in conversion rate estimation from a causal perspective. Different from existing works, our methods adjust for data MNAR mechanism, and deal with the selection bias of CVR estimation in a principled way. Meanwhile, we give mathematical proofs that the proposed methods are theoretically unbiased.

  • To the best of our knowledge, this is the first paper that combines causal inference methods with multi-task learning. We argue that such combination could make proposed estimators both effective and efficient in industrial setting. Specifically, by introducing multi-task learning, our methods could effectively address the data sparsity issue, run times faster and save half of the memory, compared with other state-of-the-art causal approaches that we modify for the CVR estimation task.

  • We proposed two principled, efficient and highly effective models, namely, Multi-task Inverse Propensity Weighting estimator (Multi-IPW) and Multi-task Doubly Robust estimator Multi-DR). Multi-IPW integrates inverse propensity-score weighting technique with multi-task learning, which aims to solve both selection bias and data sparsity issues. Multi-DR augments Multi-IPW with doubly robust mechanism. It enjoys the merits of Multi-IPW, and an expanding training set.

  • We conducted extensive experiments on industrial-level production dataset, i.e., Taobao dataset (11.6 Billion samples), and the public dataset, i.e., Ali-CCP dataset (84 Million samples), to compare Multi-IPW and Multi-DR against several state-of-the-art CVR prediction models. Experiment results show that our models outperform ESMM [ma_entire_2018] and state-of-the-art causal models, and demonstrate the efficiency of our methods in real industrial setting.

2 Related works

In this section, we review several existing works that attempt to tackle the selection bias issue in recommender systems. Meanwhile, we summarize how our methods are different from prior works.

Ma et al. [ma_entire_2018]

proposed the Entire Space Multi-task Model (ESMM) to remedy selection bias and data sparsity issues in the conversion rate (CVR) estimation. ESMM is trained in the entire exposure space, and it reduces CVR task to two auxiliary tasks, i.e., click-through rate (CTR) and click-through&conversion rate (CTCVR) estimations. However, we argue that ESMM is not a unbiased estimator, hence can only address the selection bias issue to some extent. The details of our argument are presented in section


Causal inference provides us with a way to account for the data generation process when we attempt to restore the information of the data that is MNAR [little2002statistical]. Schnabel et al. [schnabel_recommendations_2016]proposed a propensity score weighting based estimator in an empirical risk minimization framework for learning and evaluating recommender systems from biased data. Propensity-based models may still be biased if the propensities are not accurately estimated (mathematical proves can be seen in Section 3.4). Wang et al. [wang_doubly_2019]

proposed a doubly robust joint learning approach for estimating item ratings that are MNAR. Doubly Robust (DR) estimator combines the propensity-based methods with an imputation model that estimates the prediction error for the missing data. When the propensities are not accurately learned, DR estimator can still enjoy unbiasedness as long as its imputation model is accurate. However, DR is not devised for CVR estimation, hence it fails to account for the severe data sparsity issue that widely exists in the CVR estimation tasks. In addition, the joint learning approach tends to be incapable of a large-scale training dataset. We will further discuss the joint learning DR in Section


To summarize, our approach differs from aforementioned methods in three aspects: 1) The problems are different. We developed our methods for CVR estimation in e-commerce system, while they focus on the rating prediction [lu2018coevolutionary]

. 2) The challenges are different. we design our models to address the selection bias and data sparsity issues, while they only consider the former (ESMM considers both). 3) The methods are different. we integrate multi-task framework with causal approaches. Specifically, We co-train propensity model, imputation model and prediction model simultaneously with deep neural networks, while they train these modules separately or alternatively, and usually with models such as linear regression or matrix factorization

[gopalan2014content, guo2017deepfm, rendle2010factorization, he2016fast]. We will further justify our design in Section 4 and report the performance improvement in Section 5.

3 Unbiased estimators for CVR prediction

Figure 2: A toy example that demonstrates ESMM is not an unbiased CVR estimator.

In this section, we firstly present the mathematical formulation of estimation bias as well as notations used in this paper. Then we demonstrate why ESMM and its variant are incapable of eliminating selection bias in CVR estimation. Next, we formulate the CVR estimation task as a causal inference problem. We will discuss how causal intervention can remove the selection bias. Then, we present two existing unbiased estimator (e.g., Inverse Propensity Weighting (IPW) estimator, and Doubly Robust (DR) estimator) in causal inference, and pinpoint the drawbacks of these models.

3.1 Preliminary

Let be a set of users and be a set of items. denotes the user-item pairs. Let be the true conversion label matrix where each entry . A CVR estimator predicts the probability of conversion for each user-item pair in . Let be the predicted conversion score matrix where each entry . Therefore, the Prediction inaccuracy, , over all user-item pairs can be formulated as follows,


where is the cross-entropy loss between and . Let be the indicator matrix where each entry is an observation indicator: if a user clicks on item ; otherwise, . Naive CVR prediction models are trained only using observed data } . Let and be the set of conversion labels that are present and absent in our data. We evaluate these naive CVR models by averaging the cross-entropy loss over the observed data [schnabel_recommendations_2016, wang_doubly_2019],


where . Let

be the probability distribution that governs the missingness variable

. The missingness of conversion feedback depends on the CTR. Therefore, this user self-selection process generates data that is MNAR [rubin1976inference, enders2010applied].

We say a CVR estimator, , is unbiased when the expectation of the estimated prediction inaccuracy over the training dataset equals to the Prediction Inaccuracy , i.e., , otherwise it is biased. If data is missing completely at random (MCAR), then the estimation is unbiased. In other words, observed data can be deemed as a sample drawn randomly from the population. The expectation of estimated prediction inaccuracy in these samples equals to the value in the population [kallenberg2006foundations, liang2016modeling]. However, if data is MNAR, .

3.2 Is ESMM an unbiased CVR estimator?

In this section, we demonstrate that ESMM, the state-of-the-art CVR estimator in practice, is essentially biased, though the author claim in paper that the model eliminates the selection bias [ma_entire_2018]. We firstly formulate the estimation bias of ESMM, and prove it is not theoretically unbiased by giving a counter example. Then we intuitively pinpoint the weakness of ESMM, and provide a more in-depth explanation from a causal perspective in the next section.

Let , be the cross-entropy losses of CTR, CVR, and CTCVR tasks. Then we have,


We can easily verify that using the counter example in Figure 2. Note that to be theoretically unbiased, ESMM should satisfy . Therefore, we conclude that ESMM cannot ensure unbiased CVR estimation, and it is a biased CVR estimator.

ESMM proposes a workaround method to estimate the CVR prediction inaccuracy in the entire exposure space, compared to the conventional estimators in the click space. However, the true conversion feedback in the exposure space are still missing not at random. Given ESMM can accurately prediction CTR and CTCVR in the exposure space, the unbiasness of ESMM still depends on whether the ture CVR can be predicted accurately for the unclicked data. Since ESMM is not trained with the correct conversion labels of these unclicked data, it is unlikely to predict their CVR accurately, hence it tends to be biased. In the next section, we will provide a more in-depth discussion of such weakness from a causal perspective.

3.3 A causal perspective to unbiased CVR estimation

Recall that selection bias in CVR estimation comes from the fact that models are trained over the click space , whereas the predictions are made over the exposure space (See figure 1). Ideally, we can remove the selection bias by building our CVR estimators using a dataset where the conversion labels of all the items are known. In the language of causal inference, it is equivalent to training CVR estimators on a "do dataset", where causal intervention is applied on click event during the data generation process. Specifically, users are "forced" to click on every item in the exposure space and further make their purchase decisions. Note that the training space is the same as the inference space in the "do dataset". Hence, the selection bias is eliminated. Intuitively, we can also understand how causal intervention removes the bias in figure 3. denotes the self-selection factors that affect both click events and conversion events. For example, can be the purchase interest or price discount that customers consider in online-shopping. In causal inference, we refer as "confounder(s)" that biases the CVR inference [wang2018deconfounded]. Once the causal intervention is applied on the click event (i.e., users are forced to click on all exposed items), has no control over user click behaviors. It means that we have successfully removed the confounder which biases our CVR estimation [pearl2009causal, pearl2018book, wang2018deconfounded, pearl2000causality, morgan2015counterfactuals].

Figure 3: This causal graph formulate CVR estimation as a causal problem. [pearl2018book] In (a), is a confounder that affects both clicks and purchases, and it biases the inference. In (b), we apply intervention on click events (do(Click)=1). Once users are "forced" to click on each exposed item, has no control over user click behaviors. Note the absence of the arrow from to Click. Hence, we have successfully removed the confounder , and the selection bias [pearl2009causal, pearl2018book, wang2018deconfounded, pearl2000causality, morgan2015counterfactuals].

Apparently, this "do dataset" generated in this imaginary intervention experiment cannot be obtained in reality. Now the challenge is how to train our CVR estimators on the observed dataset as if we do on the "do dataset". In the following sections, we will discuss two estimators that can achieve unbiased CVR prediction with the data that are MNAR.

As for ESMM and its variants, they are directly trained over the entire exposure space, which is different from the "do dataset". In "do dataset", we apply causal intervention on the click events during the data generation process to obtain the true conversion label matrix (see matrix e in Figure 2). Without such intervention, the conversion labels in the dataset are still missing not at random. For example, in Figure 2, matrix f is obtained without causal intervention. The conversion label of the last entry is observed as 0 because its click label is 0, whereas the true conversion label is 1 in the matrix e. We can also understand the weakness of ESMM and its variant from using causal graph (see figure 3), CVR estimators cannot achieve unbiased CVR estimation as long as they fail to remove the confounder .

3.4 Inverse propensity weighting CVR estimator

Recall that the missingness in observed data is governed by a probability distribution, . Equivalently, the probability distribution determines the data generation process. The marginal probability is referred as propensity score, , of observing an entry in . In practice, the real can not be obtained directly. Instead, we estimate the real propensity with the learned propensity, . Inverse propensity weighting (IPW) CVR estimator uses to inversely weight prediction loss [little2002statistical, imbens2015causal, schnabel_recommendations_2016, hirano2003efficient],


Since the IPW CVR estimator adjusts for the data generation process, its CVR estimation is theoretically unbiased if the propensity estimation is accurate (i.e., ),


Typically, the learned propensity score is estimated with an independent logistic regression model

[austin2011introduction]. In Section 4

, we proposed the Multi-IPW model which leverages the multi-task framework to simultaneously learn the propensity score (i.e., CTR in Multi-IPW) with CVR. The naive IPW estimator (Equation (4)) might suffer from a high variance

[gilotte2018offline, schnabel_recommendations_2016, wang_doubly_2019]. Swaminathan et al. proposed a self-normalized inverse propensity scoring (SNIPS) estimator to reduce the variability [swaminathan2015self]. Multi-IPW model also incorporates this technique.

3.5 Doubly robust CVR estimator

Inverse propensity weighting model is unbiased contingent on the assumption that we can accurately estimate the real propensity (i.e., ). In real practice, this assumption is too restricted. In order to loose such constraint, doubly robust estimator is introduced by previous works [dudik2011doubly, wang_doubly_2019, vermeulen2015bias, farajtabar2018more].

Wang et al.[wang_doubly_2019] proposed a joint learning approach for training a doubly robust estimator . The joint learning approach alternatively trains two models: 1) a prediction model ; and 2) an imputation model . The prediction model, parameterized by , is designed to predict the conversion rate, and its performance is evaluated by . The imputation model, parameterized by , aims to estimate the prediction error with . Its performance is assessed by

. The feature vector

encodes all the information about the user and the item . Then, we can formulate the loss of doubly robust CVR estimator as,

Figure 4: Multi-Inverse Propensity Weighting (Multi-IPW) estimator, and Multi-Doubly Robust (Multi-DR) estimator. The Multi-DR estimator augments Multi-IPW with an imputation model. We use predicted CTR as propensity scores in the Multi-IPW estimator. In the multi-task learning module, CTR task, CVR task, and Imputation task are chained together with shared embedding space. Multi-IPW only incorporates CTR task and CVR task, whereas Multi-DR includes Imputation task as well.

Recall that the bias of a prediction model,, is quantified by . Hence, we can mathematically derive the bias of the doubly robust CVR estimator,


where , and . Note that doubly robust estimator is unbiased if either true propensity scores or true prediction errors are accurately estimated. This is a clear advantage over the naive IPW model whose estimation is unbiased only if we can accurately learn the true propensities.

In real-world practice, we argue that joint learning approach may demands a large amount of computational and storage resources. On one hand, we need to train a separate propensity model, which takes almost the same time and memory as we need in the CVR task. On the other hand, we need to alternatively switch between the training of the prediction and imputation model. Such alternating process may greatly increase the training time. As the size of dataset grows, the training time can quickly become unmanageable in industrial production.

In next section, we will discuss the Multi-DR model which aims to efficiently solve the selection bias and data sparsity issues in CVR estimation task under industrial setting.

4 Causal CVR Estimators with multi-task learning

We present the architecture of the Multi-Inverse Propensity Weighting (Multi-IPW) estimator, and the Multi-Doubly Robust (Multi-DR) estimator in Figure 4. Both proposed estimators are developed based on a multi-task learning module, which mitigates the data sparsity issue, and boosts the efficiency during training at industrial-level.

4.1 Multi-task learning module

As the building block of our causal CVR estimators, the multi-task learning module exploits the typical sequential events on e-commerce site, i.e., "exposureclickconversion", and chains the main CVR task with an auxiliary CTR task. In industrial-level CVR modeling, the data sparsity issue refers to the fact that the size of CVR training dataset is limited by the rare click events, and the large model parameters may not be well-trained. To address this issue, we adopt the philosophy of multi-task learning and introduce an auxiliary CTR task [ni2018perceive, liang2016modeling, gao2019neural, ma_entire_2018, hadash2018rank, zhou2018deep, wang2019multi]. The amount of training data in CTR task is generally larger than that in CVR task by order of magnitudes, thus CTR task trains the large model parameters more sufficiently. The feature representation learned in the CTR task is shared with the CVR task, hence the CVR model can benefit from the extra information through parameter sharing. Therefore, the data sparsity issue is remedied by this multi-task learning mechanism. Meanwhile, multi-task learning is also well-known for being cost-effective in model development. Typically, multi-task learning co-trains multiple tasks simultaneously as if they were one task. This mechanism can potentially reduce storage space for saving duplicate copies of embedding representation. In addition, the parallel training mechanism generally reduces the training time by large. The proposed Multi-IPW and Multi-DR inherit aforementioned merits by incorporating multi-task learning module in causal approaches.

4.2 Multi-inverse propensity weighting estimator

Figure 4 shows that Multi-IPW model is build upon CVR task and CTR task. Conventional IPW estimators normally require an independent model (e.g., logistic regression) to predict the true propensity scores [schnabel_recommendations_2016]. However, Multi-IPW estimator can exploit the multi-task learning architecture, and use the predicted CTR scores generated by CTR task as propensity scores. Since the CTR task and CVR task share embedding parameters, we can expect better CVR prediction performance due to the extra information provided by CTR task. In addition, multi-task learning reduces the training time of Multi-IPW by half. These are clear advantages over conventional IPW estimators.

We train Multi-IPW by minimizing the following training loss,


where represents the shared embedding parameters. and are neural network parameters of CVR task and CTR task, respectively. , parameterized by and , is the cross entropy loss of true CVR label and predicted CVR score . We use the predicted CTR score , parameterized by and , as propensities. denotes all data in exposure space.

Theorem 1

Given the true propensities and the true conversion labels . The Multi-IPW CVR estimator gives unbiased CVR prediction when estimated CTR scores are accurate

Input: Observed conversion labels and user-item features
while stopping criteria is not satisfied do
      Sample a batch of user-item pairs from Co-train CTR task and CVR task
            Update by descending along the gradients Update by descending along the gradients
end while
Algorithm 1 Multi-Inverse Propensity Weighting

4.3 Multi-doubly robust estimator

Multi-DR estimator augments Multi-IPW estimator and adds double robustness by including an imputation model which aims to estimate the prediction error , where

denotes the loss function of prediction model. Bear in mind, Multi-DR also inherits the merits from multi-task learning module described in section

4.1. Multi-DR optimizes the following loss


where represents the shared embedding parameters among CTR task, CVR task, and Imputation task. , , are neural network parameters of CTR task, CVR task, imputation task, respectively. is the predicted CTR score given by CTR task. Estimated imputation error , parameterized by and , is given by imputation task. .

Theorem 2

Given the true propensities and the true conversion labels . The Multi-DR CVR estimator gives unbiased CVR prediction when either estimated CTR score are accurate or the estimated prediction error is accurate

Input: Observed conversion labels and user-item features
while stopping criteria is not satisfied do
       Sample a batch of user-item pairs from Co-train CTR task, CVR task, and Imputation task
             Update by descending along the gradients Update by descending along the gradients Update by descending along the gradients
end while
Algorithm 2 Multi-Doubly Robust

5 Experiments

In this section, we evaluate the performance of the proposed models with a public dataset and an large-scale production dataset collected from Mobile Taobao, the leading e-commerce platform in China. The experiments are intended to answer the following questions:

  • Q1: Do our proposed approaches outperform other state-of-art CVR estimation methods?

  • Q2: Are our proposed models more efficient in industrial setting than other baseline models?

  • Q3: How is the performance of our proposed models affected by hyper-parameters?

5.1 Datasets

We conduct the experiments on the following two datasets:


Alibaba Click and Conversion Prediction (Ali-CCP) dataset is collected from real-world traffic logs of the recommender system in Taobao platform [ma_entire_2018]. The dataset includes 84M data samples that contain 3.4M clicks, and 18K conversions. Ali-CCP is divided approximately 50/50 into training set and testing set. This public dataset contains three categories of features: 1) user features, 2) item features, 3) combination features.

Production sets

This industrial production dataset is collected from the Mobile Taobao e-commerce platform. We have collected 3-week transaction data from 2019-08-17 to 2019-09-08. There are 11.5 Billion data samples, 0.6 Billion clicks, and 8.3 Million conversions. Our production dataset includes 109 features, which are primarily categorized into: 1) user features, 2) item features, 3) combination features. We further divide this 3-week transaction dataset into 4 subsets:

  • Set A uses data in 2019-08-17 as training set, and data in the next day as testing data set. Set A is approximately 5% of the entire dataset.

  • Set B partitions data from 2019-08-17 to 2019-08-20 as training set, and data in 2019-08-21 as testing set. Set B is approximately 20% of the entire dataset.

  • Set C uses data from 2019-08-17 to 2019-08-27 as training set, and data in 2019-08-28 as testing set. Set C is approximately 50% of the entire dataset.

  • Set D contains all the data from 2019-08-17 to 2019-09-08. Training set includes all data in 3 weeks but the last day (2019-09-08). Testing set includes the data in 2019-09-08.

Dataset # Exposure # Click # Conversion # user # item
Ali-CCP 84M 3.4M 18k 0.4M 4.3M
Set A 1.1B 54.5M 0.6M - 22.5M
Set B 2.7B 0.2B 1.9M - 39.1M
Set C 6.0B 0.4B 4.3M - 62.6M
Set D 11.5B 0.6B 8.3M - 81.5M
Table 1: Statistics of experimental datasets

5.2 Baseline models

We compare Multi-IPW model and Multi-DR model with the following baselines. Note that some baselines are causal estimators which we modify to predict the unbiased CVR, and others models are existing non-causal estimators designed for CVR predictions.

5.2.1 Non-causal estimators

  • Base

    is a naive post-click CVR model, which is a Multi-layer Perceptron (See the CVR task in Figure

    4). Note that this is essentially an independent MLP model which takes the feature embeddings as input and predicts the CVR. The base model is trained in the click space.

  • Oversampling [weiss2004mining] deals with the class imbalance issue by duplicating the minority data samples (conversion=1) in training set with an oversampling rate k. In our experiment, we set k = 5. The oversampling model is trained in the click space.

  • ESMM [ma_entire_2018] utilizes multi-task learning methods and reduces the CVR estimation into two auxiliary tasks, i.e., CTR task and CTCVR task. ESMM is trained in the entire exposure space, and deemed as the state-of-the-art CVR estimation model in real practice.

  • Naive Imputation takes all the unclicked data as negative samples. Hence, it is trained in the entire exposure space.

5.2.2 Causal estimators

  • Naive IPW[schnabel_recommendations_2016] is the naive IPW estimator described in section 3.4. Note that it is not specifically designed for CVR estimation task as CVR prediction has its intrinsic issues. For example, it cannot deal with the data sparsity issue that inherently exists in CVR task.

  • Joint Learning DR [wang_doubly_2019] is devised to learn from ratings that are missing not at random. In this experiment, we tailor Joint Learning DR for the CVR estimation. Similarly, Joint learning DR handles data sparsity issue poorly.

  • Heuristic DR is designed as a baseline for Multi-DR. It assumes that the unclicked items are negative samples with probability , where is smoothing rate and it denotes the probability of having a positive label. In the experiments, we explore in and report the best performance.

5.3 Metrics

In CVR prediction task, area under receiver operator curve (AUC) is a widely used metric [fawcett2006introduction]. One interpretation of AUC in the context of ranking system is that it denotes the probability of ranking a random positive sample higher than a negative sample. Meanwhile, we also adopt Group AUC (GAUC) in our assessments [zhou2018deep]. GAUC extends AUC by calculating the weighted average of AUC grouped by page views or users,


where is exposures. GAUC is commonly recognized as a more indicative metric in real practice [zhou2018deep]. In this paper we evaluate the proposed models and baselines with both AUC and GAUC on production sets. In the public dataset, models are only assessed with AUC as the dataset are missing the information for computing GAUC.

In addition to CVR estimation task, we also evaluate our predictive models in CTCVR estimation task, which predicts the conversion probability of an item given it is exposed to a customer. In our experiments, CTCVR is computed by . To sum up, we have three different metrics in total: 1) CVR AUC, 2) CTCVR AUC, 3) CTCVR GAUC.

5.4 Experiments setup

5.4.1 Ali-CCP experiment

The experiment setup on Ali-CCP mostly follows the prior work [ma_entire_2018]. We set the dimension of all embedding vectors to be 18. The architecture of all these multi-layer perceptrons (MLP) in multi-task learning module are identical as . The optimizer is Adam with a learning rate , and batch size is set to .

5.4.2 Production set experiment

In production set experiment, we vary the dimensions of feature embedding vectors according to each feature’s real size in order to minimize the memory usage. In order to have a fair comparison study, all the models in this experiment share , MLP architecture , adam optimizer with learning rate . We also added normalization to imputation model in Multi-DR, and the coefficient is .

5.5 Experiment results (Q1)

In this section, we demonstrate that our proposed approaches clearly outperform other baseline CVR estimation methods. We conduct experiments on the public dataset (Ali-CCP) with 84 Million data samples, and the production dataset with 11.5 Billion data samples. The results are summarized in table 2 and table 3. Multi-IPW and Multi-DR are clear winners over other baselines across all experiments. Meanwhile, we have the following observations:

  • Multi-IPW and Multi-DR are consistently better than ESMM on both public dataset and production dataset. We believe that the performance boost comes from the fact that our proposed models account for the data generation process and adjust for the cause of missingness, whereas ESMM is blind to the MNAR mechanism and its prediction is biased.

  • In production dataset, Multi-IPW and Multi-DR consistently outperform Joint Learning DR[wang_doubly_2019]. We reason that the performance improvement results from multi-task learning module. Recall that in our proposed estimators, feature embeddings are shared among CVR task, CTR task, and imputation task in training phase. This mechanism counter the inherent data sparsity issue in CVR estimation task. However, the Joint Learning DR is incapable of solving this issue. Hence, we see the degradation in its prediction performance.

    Set A (1.1B) Set B (2.7B) Set C (6.0B) Set D (11.5B)
    AUC score
    Base 78.24 73.12 78.67 73.86 79.62 74.70 81.66 76.28
    Oversampling[weiss2004mining] 78.63 73.53 78.72 74.09 79.69 74.82 81.77 76.30
    ESMM[ma_entire_2018] 79.29 73.86 79.74 74.33 80.11 74.97 82.17 76.55
    Naive Imputation 78.12 73.21 78.44 73.50 79.32 73.81 81.56 76.39
    Naive IPW[schnabel_recommendations_2016] 79.23 73.82 79.73 74.34 80.14 74.92 82.13 76.45
    Heuristic DR 78.45 73.45 78.84 73.99 79.52 74.18 81.74 76.40
    Joint Learning DR[wang_doubly_2019] 79.09 73.67 79.53 74.51 80.01 74.90 82.09 76.61
    Multi-IPW 79.51 73.99 79.85 74.81 80.21 75.01 82.57 76.89
    Multi-DR 79.72 74.45 79.80 74.91 80.50 75.39 82.72 77.23
    GAUC score
    Base - 59.69 - 60.16 - 60.58 - 61.27
    Oversampling[weiss2004mining] - 60.17 - 60.28 - 60.59 - 61.30
    ESMM[ma_entire_2018] - 60.53 - 60.90 - 61.13 - 61.76
    Naive Imputation - 60.14 - 60.39 - 60.56 - 61.39
    Naive IPW[schnabel_recommendations_2016] - 60.51 - 60.95 - 61.09 - 61.77
    Heuristic DR - 60.01 - 60.30 - 60.65 - 61.35
    Joint Learning DR[wang_doubly_2019] - 60.43 - 60.83 - 60.97 - 61.67
    Multi-IPW - 60.70 - 61.09 - 61.25 - 61.98
    Multi-DR - 60.90 - 60.99 - 61.52 - 62.28
    Table 2: Results of comparison study on Production datasets. The best scores are bold-faced in each column. Note that this table has two sections, AUC scores and GAUC scores. The rows that contain the models proposed in this paper are highlighted in color grey.
    Base 66.00 0.37 62.07 0.45
    Oversampling [weiss2004mining] 67.18 0.32 63.05 0.48
    ESMM-NS [ma_entire_2018] 68.25 0.44 64.44 0.62
    ESMM [ma_entire_2018] 68.56 0.37 65.32 0.49
    Multi-IPW 69.21 0.42 65.30 0.50
    Multi-DR 69.29 0.31 65.43 0.34
    Table 3: Results of comparison study on Public dataset: Ali-CCP. Experiments are repeated 10 times and mean 1 std of AUC scores are reported below. The best scores are bold-faced in each column. The rows that contain the models proposed in this paper are highlighted in color grey.
  • We observe that Multi-IPW estimator is superior to Naive IPW estimator in all experiments. Note that there is a trade-off between an independent, potentially more well-trained propensity model and the merit of multi-task learning, i.e., information sharing between modules. We observe that by sacrificing the accuracy of propensity model to some extent, we can achieve better performance from a co-trained prediction model.

  • We notice that Multi-DR has better performance than Multi-IPW in most cases. Recall that Multi-DR augments Multi-IPW by introducing an imputation model. Provided that , the tail bound of Multi-DR is proven to be lower than that of Multi-IPW for any learned propensity score [wang_doubly_2019]. Therefore, Multi-DR is expected to perform better than Multi-IPW when the imputation model is well-trained. Note that Multi-DR is more complex in design than Multi-IPW. We need to pay the price of adding complexity in model implementation for the performance boost. In real practice, we also need to account for the difficulty of model development. Therefore, Multi-IPW is designed for the scenario where quick and cheap model deployment is demanded, whereas Multi-DR is preferable when we have enough time to search for a good imputation model and sampling rate.

The experiment results demonstrate that Multi-IPW and Multi-DR counter the selection bias and data sparsity issues in CVR estimation in a principled and highly effective way. In the next subsection, we will discuss other strengths of the proposed methods.

5.6 Computational efficiency (Q2)

Figure 5:

Computational cost of Multi-IPW and Multi-DR. The left subplot reveals the hours needed to complete one epoch of training. The middle subplot shows the size of embedding parameters of each model. The right subplot shows the size of hidden layer parameters of each model. Note that the proposed models achieve the best prediction performance, while have the lowest computational cost.

In this part, we study the computational efficiency of the proposed CVR estimators against the baselines under industrial setting. We summarized the records of training time and parameter space size of each participant in Figure 5. The results prove that the proposed methods are cost-effective. This comparison study is conducted in a distributed cluster with the configuration summarized in table 4.

We observe that joint learning DR and naive IPW demands over 30 hours of training for one epoch with the whole production dataset. Recall that naive IPW needs to train an independent propensity estimator before fitting the CVR predictor. This doubles the training time needed by multi-task learning CVR estimators (e.g., ESMM, Multi-DR, and Multi-IPW). Bear in mind that multi-task learning methods train CTR task, CVR task, and imputation task at the same time. As for joint learning DR, it trains the imputation model and prediction model alternatively. Hence, the training time it needs is at least doubled than the other estimators.

In addition, we notice that Multi-DR and Multi-IPW are also storage efficient compared with joint learning DR and naive IPW. Since the embedding parameters are shared among CTR task, CVR task and imputation task in the proposed models, we only need to save one copy of those trainable parameters in parameter severs. Note that Multi-DR has more hidden layer parameters than ESMM and Naive-IPW, due to the additional imputation model. However, the memory usage of parameters in hidden layers is generally ignorable compared with the huge storage consumption of the embedding layers. In a nutshell, the proposed models achieve the best performance with the lowest computational cost.

Cluster configuration Parameter Server Worker
# instances 4 100
# CPU 28 cores 440 cores
# GPU111GPU specs: Tesla P100-PCIE-16G - 25 cards
MEMORY (GB) 40 1000
Table 4: Distributed cluster configuration

5.7 Hyper-parameters in model implementation

IPW bound is a hyper-parameter introduced in our model implementation to handle high variance of propensities. IPW bound clamps the propensities if the values are greater than the predefined threshold. A plausible IPW bound value is typically confined by . IPW bound percentage can be calculated as

. In Multi-DR, imputation model will introduce the unclicked items to the training set. Empirically, most of the unclicked items will not be purchased by customers even if they were clicked. Therefore, including these unclicked items in training set will skew the data distribution and make the class imbalance issue worse. Therefore, Instead of adding all the unclicked samples, we under-sample them with a sampling rate

. For example, if the number of clicked samples () is 100 and the batch size is 1000, means that after under-sampling the samples we used to train Multi-DR is . Note that without under-sampling, Multi-DR takes all samples in the batch as training samples.

5.8 Empirical study on hyper-Parameter sensitivity (Q3)

Figure 6: Results of parameter sensitivity experiments.

In this section, we investigate how Multi-IPW and Multi-DR are affected by two important hyper-parameters in our model implementation, IPW bound and sampling rate . We evaluate the performance of Multi-IPW with varying IPW bound . We observe that in Figure 6, when IPW bound , prediction performance eventually improves as IPW bound increases. We can clearly see the performance drop of CVR AUC if the threshold is greater than . We reason that larger IPW bound allows undesired higher variability of propensity scores, which may lead to sub-optimal prediction performance.

We evaluate the performance of Multi-DR with varying sampling rate . We observe that produces the best prediction, and the model performance starts decreasing when . We argue that, as the sampling rate increases, more unclicked samples are included to our training set, and it inevitably worsens the class imbalance issue, which typically causes predictive models to generalize poorly. On the contrary, introducing a small number of unclicked samples from the imputation model can boost our CVR prediction (see figure (d) when ).

6 Conclusion and future works

In this paper, we proposed Multi-IPW and Multi-DR CVR estimators for industrial recommender system. Both CVR estimators aim to counter the inherent issues that exist in real practice: 1) selection bias, 2)data sparsity. Extensive experiments with billions data samples demonstrate that our methods outperform the state-of-the-art CVR predictive models, and handle CVR estimation task in a principled, highly effective and efficient way. Our methods exploit the sequential events in e-commerce, "exposureclickconversion", by adopting the multi-task learning approach to boost the model performance and reduce the model computational cost. Although our methods are devised for CVR estimation, the idea of our methods can be generalized to other prediction tasks in recommender systems. For example, in CTR estimation task, the recommender system typically chooses which items will be exposed to customers. The dataset for CTR estimation in this scenario might also suffer the data missing not at random. The sequential pattern in CTR prediction task can be summarized as "item pool exposureclick". Intuitively, our methods can be modified for this task. Future work may explore the possibilities of applying our models in various prediction tasks in industrial recommender systems.