1 Introduction
Selection bias is a widelyrecognized issue in recommender systems [schnabel_recommendations_2016, ma_entire_2018, de2014reducing]. For example, music stream services usually suggest genres that have positive user feedback (e.g. favorite, share, and buy, etc), and selectively ignore the ones that are rarely exposed to users [van2013deep]. In this paper, we study the selection bias that exists in postclick conversion rate (CVR) estimation.
Problem formulation
Postclick conversion rate (CVR) estimation is an essential task in ecommerce recommender systems. Generally, conversion rates are used to compute item ranking scores, and we prioritize and recommend items with high ranking scores to our customers. A typical ecommerce transaction has the following sequential path: "exposure click
conversion". Postclick conversion rate indicates the probability of transitions from click to conversion. Typically, when training CVR models, we only consider the items that customers clicked on as we are unaware of the conversion feedback of the items that are not clicked by customers
[huang2007correcting]. Therefore, those unclicked data are excluded from our training dataset. Among the clicked items, the ones which customers purchase are positive samples, whereas others are negative samples. Bear in mind, not clicking on an item does not necessarily indicate the customer is not interested in purchasing it. Customers may unconsciously skip certain items that might be interesting to them. Figure 1 reveals that the exposure space is a super set of the click space . Selection bias comes from the fact that conventional CVR models are trained in the click space, whereas the predictions are made within the entire exposure space (Figure 1) [ma_entire_2018]. Intuitively, data in the click space is drawn from the entire exposure space and is biased by user selfselection (e.g., the interests of users). Therefore, the data distribution in the click space is systematically different from the one in the exposure space. This inherent discrepancy leads to data that is Missing Not At Random (MNAR), and selection bias in the conventional CVR models [de2014reducing, little2019statistical, little2002statistical].We identify two practical issues that make CVR estimation quite challenging in industriallevel recommender systems:

Selection bias: The systematic difference of data distributions between training space (i.e., all user selfselected items) and inference space (i.e., all exposed items) biases conventional CVR models [steck2010training, huang2007correcting, ai2018unbiased]. This bias usually causes the degradation of model performance.

Data sparsity: In the CVR estimation task, it refers to the fact that item clicks are rare events (we have a CTR of 5.2% in the production dataset and 4% in the public dataset). Conventional CVR models are typically trained only using data in the click space. Therefore, the number of training samples may not be sufficient for the large parameter space. In our experiments, the numbers are 0.6 billion samples vs. 5.3 billion parameters in production dataset, and 4.3 million samples vs. 2.6 billion parameters in public dataset (see Section 5.1) [lee2012estimating, wang2018billion].
To address the critical issues of selection bias and data sparsity in the CVR estimation, we approach the problem from a causal perspective, and develop causal methods in a multitask learning framework. We propose two principled, efficient and highly effective CVR estimators, namely, Multitask Inverse Propensity Weighting estimator (MultiIPW) and Multitask Doubly Robust estimator (MultiDR). Our methods are designed for an unbiased CVR estimation, and also account for the data sparsity issue.
The main contributions of this paper are summarized as follows:

To be best of our knowledge, this is the first paper that aims to solve the selection bias issue in conversion rate estimation from a causal perspective. Different from existing works, our methods adjust for data MNAR mechanism, and deal with the selection bias of CVR estimation in a principled way. Meanwhile, we give mathematical proofs that the proposed methods are theoretically unbiased.

To the best of our knowledge, this is the first paper that combines causal inference methods with multitask learning. We argue that such combination could make proposed estimators both effective and efficient in industrial setting. Specifically, by introducing multitask learning, our methods could effectively address the data sparsity issue, run times faster and save half of the memory, compared with other stateoftheart causal approaches that we modify for the CVR estimation task.

We proposed two principled, efficient and highly effective models, namely, Multitask Inverse Propensity Weighting estimator (MultiIPW) and Multitask Doubly Robust estimator MultiDR). MultiIPW integrates inverse propensityscore weighting technique with multitask learning, which aims to solve both selection bias and data sparsity issues. MultiDR augments MultiIPW with doubly robust mechanism. It enjoys the merits of MultiIPW, and an expanding training set.

We conducted extensive experiments on industriallevel production dataset, i.e., Taobao dataset (11.6 Billion samples), and the public dataset, i.e., AliCCP dataset (84 Million samples), to compare MultiIPW and MultiDR against several stateoftheart CVR prediction models. Experiment results show that our models outperform ESMM [ma_entire_2018] and stateoftheart causal models, and demonstrate the efficiency of our methods in real industrial setting.
2 Related works
In this section, we review several existing works that attempt to tackle the selection bias issue in recommender systems. Meanwhile, we summarize how our methods are different from prior works.
Ma et al. [ma_entire_2018]
proposed the Entire Space Multitask Model (ESMM) to remedy selection bias and data sparsity issues in the conversion rate (CVR) estimation. ESMM is trained in the entire exposure space, and it reduces CVR task to two auxiliary tasks, i.e., clickthrough rate (CTR) and clickthrough&conversion rate (CTCVR) estimations. However, we argue that ESMM is not a unbiased estimator, hence can only address the selection bias issue to some extent. The details of our argument are presented in section
3.2.Causal inference provides us with a way to account for the data generation process when we attempt to restore the information of the data that is MNAR [little2002statistical]. Schnabel et al. [schnabel_recommendations_2016]proposed a propensity score weighting based estimator in an empirical risk minimization framework for learning and evaluating recommender systems from biased data. Propensitybased models may still be biased if the propensities are not accurately estimated (mathematical proves can be seen in Section 3.4). Wang et al. [wang_doubly_2019]
proposed a doubly robust joint learning approach for estimating item ratings that are MNAR. Doubly Robust (DR) estimator combines the propensitybased methods with an imputation model that estimates the prediction error for the missing data. When the propensities are not accurately learned, DR estimator can still enjoy unbiasedness as long as its imputation model is accurate. However, DR is not devised for CVR estimation, hence it fails to account for the severe data sparsity issue that widely exists in the CVR estimation tasks. In addition, the joint learning approach tends to be incapable of a largescale training dataset. We will further discuss the joint learning DR in Section
3.5.To summarize, our approach differs from aforementioned methods in three aspects: 1) The problems are different. We developed our methods for CVR estimation in ecommerce system, while they focus on the rating prediction [lu2018coevolutionary]
. 2) The challenges are different. we design our models to address the selection bias and data sparsity issues, while they only consider the former (ESMM considers both). 3) The methods are different. we integrate multitask framework with causal approaches. Specifically, We cotrain propensity model, imputation model and prediction model simultaneously with deep neural networks, while they train these modules separately or alternatively, and usually with models such as linear regression or matrix factorization
[gopalan2014content, guo2017deepfm, rendle2010factorization, he2016fast]. We will further justify our design in Section 4 and report the performance improvement in Section 5.3 Unbiased estimators for CVR prediction
In this section, we firstly present the mathematical formulation of estimation bias as well as notations used in this paper. Then we demonstrate why ESMM and its variant are incapable of eliminating selection bias in CVR estimation. Next, we formulate the CVR estimation task as a causal inference problem. We will discuss how causal intervention can remove the selection bias. Then, we present two existing unbiased estimator (e.g., Inverse Propensity Weighting (IPW) estimator, and Doubly Robust (DR) estimator) in causal inference, and pinpoint the drawbacks of these models.
3.1 Preliminary
Let be a set of users and be a set of items. denotes the useritem pairs. Let be the true conversion label matrix where each entry . A CVR estimator predicts the probability of conversion for each useritem pair in . Let be the predicted conversion score matrix where each entry . Therefore, the Prediction inaccuracy, , over all useritem pairs can be formulated as follows,
(1) 
where is the crossentropy loss between and . Let be the indicator matrix where each entry is an observation indicator: if a user clicks on item ; otherwise, . Naive CVR prediction models are trained only using observed data } . Let and be the set of conversion labels that are present and absent in our data. We evaluate these naive CVR models by averaging the crossentropy loss over the observed data [schnabel_recommendations_2016, wang_doubly_2019],
(2)  
where . Let
be the probability distribution that governs the missingness variable
. The missingness of conversion feedback depends on the CTR. Therefore, this user selfselection process generates data that is MNAR [rubin1976inference, enders2010applied].We say a CVR estimator, , is unbiased when the expectation of the estimated prediction inaccuracy over the training dataset equals to the Prediction Inaccuracy , i.e., , otherwise it is biased. If data is missing completely at random (MCAR), then the estimation is unbiased. In other words, observed data can be deemed as a sample drawn randomly from the population. The expectation of estimated prediction inaccuracy in these samples equals to the value in the population [kallenberg2006foundations, liang2016modeling]. However, if data is MNAR, .
3.2 Is ESMM an unbiased CVR estimator?
In this section, we demonstrate that ESMM, the stateoftheart CVR estimator in practice, is essentially biased, though the author claim in paper that the model eliminates the selection bias [ma_entire_2018]. We firstly formulate the estimation bias of ESMM, and prove it is not theoretically unbiased by giving a counter example. Then we intuitively pinpoint the weakness of ESMM, and provide a more indepth explanation from a causal perspective in the next section.
Let , be the crossentropy losses of CTR, CVR, and CTCVR tasks. Then we have,
(3)  
We can easily verify that using the counter example in Figure 2. Note that to be theoretically unbiased, ESMM should satisfy . Therefore, we conclude that ESMM cannot ensure unbiased CVR estimation, and it is a biased CVR estimator.
ESMM proposes a workaround method to estimate the CVR prediction inaccuracy in the entire exposure space, compared to the conventional estimators in the click space. However, the true conversion feedback in the exposure space are still missing not at random. Given ESMM can accurately prediction CTR and CTCVR in the exposure space, the unbiasness of ESMM still depends on whether the ture CVR can be predicted accurately for the unclicked data. Since ESMM is not trained with the correct conversion labels of these unclicked data, it is unlikely to predict their CVR accurately, hence it tends to be biased. In the next section, we will provide a more indepth discussion of such weakness from a causal perspective.
3.3 A causal perspective to unbiased CVR estimation
Recall that selection bias in CVR estimation comes from the fact that models are trained over the click space , whereas the predictions are made over the exposure space (See figure 1). Ideally, we can remove the selection bias by building our CVR estimators using a dataset where the conversion labels of all the items are known. In the language of causal inference, it is equivalent to training CVR estimators on a "do dataset", where causal intervention is applied on click event during the data generation process. Specifically, users are "forced" to click on every item in the exposure space and further make their purchase decisions. Note that the training space is the same as the inference space in the "do dataset". Hence, the selection bias is eliminated. Intuitively, we can also understand how causal intervention removes the bias in figure 3. denotes the selfselection factors that affect both click events and conversion events. For example, can be the purchase interest or price discount that customers consider in onlineshopping. In causal inference, we refer as "confounder(s)" that biases the CVR inference [wang2018deconfounded]. Once the causal intervention is applied on the click event (i.e., users are forced to click on all exposed items), has no control over user click behaviors. It means that we have successfully removed the confounder which biases our CVR estimation [pearl2009causal, pearl2018book, wang2018deconfounded, pearl2000causality, morgan2015counterfactuals].
Apparently, this "do dataset" generated in this imaginary intervention experiment cannot be obtained in reality. Now the challenge is how to train our CVR estimators on the observed dataset as if we do on the "do dataset". In the following sections, we will discuss two estimators that can achieve unbiased CVR prediction with the data that are MNAR.
As for ESMM and its variants, they are directly trained over the entire exposure space, which is different from the "do dataset". In "do dataset", we apply causal intervention on the click events during the data generation process to obtain the true conversion label matrix (see matrix e in Figure 2). Without such intervention, the conversion labels in the dataset are still missing not at random. For example, in Figure 2, matrix f is obtained without causal intervention. The conversion label of the last entry is observed as 0 because its click label is 0, whereas the true conversion label is 1 in the matrix e. We can also understand the weakness of ESMM and its variant from using causal graph (see figure 3), CVR estimators cannot achieve unbiased CVR estimation as long as they fail to remove the confounder .
3.4 Inverse propensity weighting CVR estimator
Recall that the missingness in observed data is governed by a probability distribution, . Equivalently, the probability distribution determines the data generation process. The marginal probability is referred as propensity score, , of observing an entry in . In practice, the real can not be obtained directly. Instead, we estimate the real propensity with the learned propensity, . Inverse propensity weighting (IPW) CVR estimator uses to inversely weight prediction loss [little2002statistical, imbens2015causal, schnabel_recommendations_2016, hirano2003efficient],
(4)  
Since the IPW CVR estimator adjusts for the data generation process, its CVR estimation is theoretically unbiased if the propensity estimation is accurate (i.e., ),
(5)  
Typically, the learned propensity score is estimated with an independent logistic regression model
[austin2011introduction]. In Section 4, we proposed the MultiIPW model which leverages the multitask framework to simultaneously learn the propensity score (i.e., CTR in MultiIPW) with CVR. The naive IPW estimator (Equation (4)) might suffer from a high variance
[gilotte2018offline, schnabel_recommendations_2016, wang_doubly_2019]. Swaminathan et al. proposed a selfnormalized inverse propensity scoring (SNIPS) estimator to reduce the variability [swaminathan2015self]. MultiIPW model also incorporates this technique.3.5 Doubly robust CVR estimator
Inverse propensity weighting model is unbiased contingent on the assumption that we can accurately estimate the real propensity (i.e., ). In real practice, this assumption is too restricted. In order to loose such constraint, doubly robust estimator is introduced by previous works [dudik2011doubly, wang_doubly_2019, vermeulen2015bias, farajtabar2018more].
Wang et al.[wang_doubly_2019] proposed a joint learning approach for training a doubly robust estimator . The joint learning approach alternatively trains two models: 1) a prediction model ; and 2) an imputation model . The prediction model, parameterized by , is designed to predict the conversion rate, and its performance is evaluated by . The imputation model, parameterized by , aims to estimate the prediction error with . Its performance is assessed by
. The feature vector
encodes all the information about the user and the item . Then, we can formulate the loss of doubly robust CVR estimator as,(6) 
Recall that the bias of a prediction model,, is quantified by . Hence, we can mathematically derive the bias of the doubly robust CVR estimator,
(7)  
where , and . Note that doubly robust estimator is unbiased if either true propensity scores or true prediction errors are accurately estimated. This is a clear advantage over the naive IPW model whose estimation is unbiased only if we can accurately learn the true propensities.
In realworld practice, we argue that joint learning approach may demands a large amount of computational and storage resources. On one hand, we need to train a separate propensity model, which takes almost the same time and memory as we need in the CVR task. On the other hand, we need to alternatively switch between the training of the prediction and imputation model. Such alternating process may greatly increase the training time. As the size of dataset grows, the training time can quickly become unmanageable in industrial production.
In next section, we will discuss the MultiDR model which aims to efficiently solve the selection bias and data sparsity issues in CVR estimation task under industrial setting.
4 Causal CVR Estimators with multitask learning
We present the architecture of the MultiInverse Propensity Weighting (MultiIPW) estimator, and the MultiDoubly Robust (MultiDR) estimator in Figure 4. Both proposed estimators are developed based on a multitask learning module, which mitigates the data sparsity issue, and boosts the efficiency during training at industriallevel.
4.1 Multitask learning module
As the building block of our causal CVR estimators, the multitask learning module exploits the typical sequential events on ecommerce site, i.e., "exposureclickconversion", and chains the main CVR task with an auxiliary CTR task. In industriallevel CVR modeling, the data sparsity issue refers to the fact that the size of CVR training dataset is limited by the rare click events, and the large model parameters may not be welltrained. To address this issue, we adopt the philosophy of multitask learning and introduce an auxiliary CTR task [ni2018perceive, liang2016modeling, gao2019neural, ma_entire_2018, hadash2018rank, zhou2018deep, wang2019multi]. The amount of training data in CTR task is generally larger than that in CVR task by order of magnitudes, thus CTR task trains the large model parameters more sufficiently. The feature representation learned in the CTR task is shared with the CVR task, hence the CVR model can benefit from the extra information through parameter sharing. Therefore, the data sparsity issue is remedied by this multitask learning mechanism. Meanwhile, multitask learning is also wellknown for being costeffective in model development. Typically, multitask learning cotrains multiple tasks simultaneously as if they were one task. This mechanism can potentially reduce storage space for saving duplicate copies of embedding representation. In addition, the parallel training mechanism generally reduces the training time by large. The proposed MultiIPW and MultiDR inherit aforementioned merits by incorporating multitask learning module in causal approaches.
4.2 Multiinverse propensity weighting estimator
Figure 4 shows that MultiIPW model is build upon CVR task and CTR task. Conventional IPW estimators normally require an independent model (e.g., logistic regression) to predict the true propensity scores [schnabel_recommendations_2016]. However, MultiIPW estimator can exploit the multitask learning architecture, and use the predicted CTR scores generated by CTR task as propensity scores. Since the CTR task and CVR task share embedding parameters, we can expect better CVR prediction performance due to the extra information provided by CTR task. In addition, multitask learning reduces the training time of MultiIPW by half. These are clear advantages over conventional IPW estimators.
We train MultiIPW by minimizing the following training loss,
(8)  
where represents the shared embedding parameters. and are neural network parameters of CVR task and CTR task, respectively. , parameterized by and , is the cross entropy loss of true CVR label and predicted CVR score . We use the predicted CTR score , parameterized by and , as propensities. denotes all data in exposure space.
Theorem 1
Given the true propensities and the true conversion labels . The MultiIPW CVR estimator gives unbiased CVR prediction when estimated CTR scores are accurate
(9) 
4.3 Multidoubly robust estimator
MultiDR estimator augments MultiIPW estimator and adds double robustness by including an imputation model which aims to estimate the prediction error , where
denotes the loss function of prediction model. Bear in mind, MultiDR also inherits the merits from multitask learning module described in section
4.1. MultiDR optimizes the following loss(10)  
where represents the shared embedding parameters among CTR task, CVR task, and Imputation task. , , are neural network parameters of CTR task, CVR task, imputation task, respectively. is the predicted CTR score given by CTR task. Estimated imputation error , parameterized by and , is given by imputation task. .
Theorem 2
Given the true propensities and the true conversion labels . The MultiDR CVR estimator gives unbiased CVR prediction when either estimated CTR score are accurate or the estimated prediction error is accurate
(11) 
5 Experiments
In this section, we evaluate the performance of the proposed models with a public dataset and an largescale production dataset collected from Mobile Taobao, the leading ecommerce platform in China. The experiments are intended to answer the following questions:

Q1: Do our proposed approaches outperform other stateofart CVR estimation methods?

Q2: Are our proposed models more efficient in industrial setting than other baseline models?

Q3: How is the performance of our proposed models affected by hyperparameters?
5.1 Datasets
We conduct the experiments on the following two datasets:
AliCCP
Alibaba Click and Conversion Prediction (AliCCP) dataset is collected from realworld traffic logs of the recommender system in Taobao platform [ma_entire_2018]. The dataset includes 84M data samples that contain 3.4M clicks, and 18K conversions. AliCCP is divided approximately 50/50 into training set and testing set. This public dataset contains three categories of features: 1) user features, 2) item features, 3) combination features.
Production sets
This industrial production dataset is collected from the Mobile Taobao ecommerce platform. We have collected 3week transaction data from 20190817 to 20190908. There are 11.5 Billion data samples, 0.6 Billion clicks, and 8.3 Million conversions. Our production dataset includes 109 features, which are primarily categorized into: 1) user features, 2) item features, 3) combination features. We further divide this 3week transaction dataset into 4 subsets:

Set A uses data in 20190817 as training set, and data in the next day as testing data set. Set A is approximately 5% of the entire dataset.

Set B partitions data from 20190817 to 20190820 as training set, and data in 20190821 as testing set. Set B is approximately 20% of the entire dataset.

Set C uses data from 20190817 to 20190827 as training set, and data in 20190828 as testing set. Set C is approximately 50% of the entire dataset.

Set D contains all the data from 20190817 to 20190908. Training set includes all data in 3 weeks but the last day (20190908). Testing set includes the data in 20190908.
Dataset  # Exposure  # Click  # Conversion  # user  # item 

AliCCP  84M  3.4M  18k  0.4M  4.3M 
Set A  1.1B  54.5M  0.6M    22.5M 
Set B  2.7B  0.2B  1.9M    39.1M 
Set C  6.0B  0.4B  4.3M    62.6M 
Set D  11.5B  0.6B  8.3M    81.5M 
5.2 Baseline models
We compare MultiIPW model and MultiDR model with the following baselines. Note that some baselines are causal estimators which we modify to predict the unbiased CVR, and others models are existing noncausal estimators designed for CVR predictions.
5.2.1 Noncausal estimators

Base
is a naive postclick CVR model, which is a Multilayer Perceptron (See the CVR task in Figure
4). Note that this is essentially an independent MLP model which takes the feature embeddings as input and predicts the CVR. The base model is trained in the click space. 
Oversampling [weiss2004mining] deals with the class imbalance issue by duplicating the minority data samples (conversion=1) in training set with an oversampling rate k. In our experiment, we set k = 5. The oversampling model is trained in the click space.

ESMM [ma_entire_2018] utilizes multitask learning methods and reduces the CVR estimation into two auxiliary tasks, i.e., CTR task and CTCVR task. ESMM is trained in the entire exposure space, and deemed as the stateoftheart CVR estimation model in real practice.

Naive Imputation takes all the unclicked data as negative samples. Hence, it is trained in the entire exposure space.
5.2.2 Causal estimators

Naive IPW[schnabel_recommendations_2016] is the naive IPW estimator described in section 3.4. Note that it is not specifically designed for CVR estimation task as CVR prediction has its intrinsic issues. For example, it cannot deal with the data sparsity issue that inherently exists in CVR task.

Joint Learning DR [wang_doubly_2019] is devised to learn from ratings that are missing not at random. In this experiment, we tailor Joint Learning DR for the CVR estimation. Similarly, Joint learning DR handles data sparsity issue poorly.

Heuristic DR is designed as a baseline for MultiDR. It assumes that the unclicked items are negative samples with probability , where is smoothing rate and it denotes the probability of having a positive label. In the experiments, we explore in and report the best performance.
5.3 Metrics
In CVR prediction task, area under receiver operator curve (AUC) is a widely used metric [fawcett2006introduction]. One interpretation of AUC in the context of ranking system is that it denotes the probability of ranking a random positive sample higher than a negative sample. Meanwhile, we also adopt Group AUC (GAUC) in our assessments [zhou2018deep]. GAUC extends AUC by calculating the weighted average of AUC grouped by page views or users,
(12) 
where is exposures. GAUC is commonly recognized as a more indicative metric in real practice [zhou2018deep]. In this paper we evaluate the proposed models and baselines with both AUC and GAUC on production sets. In the public dataset, models are only assessed with AUC as the dataset are missing the information for computing GAUC.
In addition to CVR estimation task, we also evaluate our predictive models in CTCVR estimation task, which predicts the conversion probability of an item given it is exposed to a customer. In our experiments, CTCVR is computed by . To sum up, we have three different metrics in total: 1) CVR AUC, 2) CTCVR AUC, 3) CTCVR GAUC.
5.4 Experiments setup
5.4.1 AliCCP experiment
The experiment setup on AliCCP mostly follows the prior work [ma_entire_2018]. We set the dimension of all embedding vectors to be 18. The architecture of all these multilayer perceptrons (MLP) in multitask learning module are identical as . The optimizer is Adam with a learning rate , and batch size is set to .
5.4.2 Production set experiment
In production set experiment, we vary the dimensions of feature embedding vectors according to each feature’s real size in order to minimize the memory usage. In order to have a fair comparison study, all the models in this experiment share , MLP architecture , adam optimizer with learning rate . We also added normalization to imputation model in MultiDR, and the coefficient is .
5.5 Experiment results (Q1)
In this section, we demonstrate that our proposed approaches clearly outperform other baseline CVR estimation methods. We conduct experiments on the public dataset (AliCCP) with 84 Million data samples, and the production dataset with 11.5 Billion data samples. The results are summarized in table 2 and table 3. MultiIPW and MultiDR are clear winners over other baselines across all experiments. Meanwhile, we have the following observations:

MultiIPW and MultiDR are consistently better than ESMM on both public dataset and production dataset. We believe that the performance boost comes from the fact that our proposed models account for the data generation process and adjust for the cause of missingness, whereas ESMM is blind to the MNAR mechanism and its prediction is biased.

In production dataset, MultiIPW and MultiDR consistently outperform Joint Learning DR[wang_doubly_2019]. We reason that the performance improvement results from multitask learning module. Recall that in our proposed estimators, feature embeddings are shared among CVR task, CTR task, and imputation task in training phase. This mechanism counter the inherent data sparsity issue in CVR estimation task. However, the Joint Learning DR is incapable of solving this issue. Hence, we see the degradation in its prediction performance.
Set A (1.1B) Set B (2.7B) Set C (6.0B) Set D (11.5B) Model CVR CTCVR CVR CTCVR CVR CTCVR CVR CTCVR AUC score Base 78.24 73.12 78.67 73.86 79.62 74.70 81.66 76.28 Oversampling[weiss2004mining] 78.63 73.53 78.72 74.09 79.69 74.82 81.77 76.30 ESMM[ma_entire_2018] 79.29 73.86 79.74 74.33 80.11 74.97 82.17 76.55 Naive Imputation 78.12 73.21 78.44 73.50 79.32 73.81 81.56 76.39 Naive IPW[schnabel_recommendations_2016] 79.23 73.82 79.73 74.34 80.14 74.92 82.13 76.45 Heuristic DR 78.45 73.45 78.84 73.99 79.52 74.18 81.74 76.40 Joint Learning DR[wang_doubly_2019] 79.09 73.67 79.53 74.51 80.01 74.90 82.09 76.61 MultiIPW 79.51 73.99 79.85 74.81 80.21 75.01 82.57 76.89 MultiDR 79.72 74.45 79.80 74.91 80.50 75.39 82.72 77.23 GAUC score Base  59.69  60.16  60.58  61.27 Oversampling[weiss2004mining]  60.17  60.28  60.59  61.30 ESMM[ma_entire_2018]  60.53  60.90  61.13  61.76 Naive Imputation  60.14  60.39  60.56  61.39 Naive IPW[schnabel_recommendations_2016]  60.51  60.95  61.09  61.77 Heuristic DR  60.01  60.30  60.65  61.35 Joint Learning DR[wang_doubly_2019]  60.43  60.83  60.97  61.67 MultiIPW  60.70  61.09  61.25  61.98 MultiDR  60.90  60.99  61.52  62.28 Table 2: Results of comparison study on Production datasets. The best scores are boldfaced in each column. Note that this table has two sections, AUC scores and GAUC scores. The rows that contain the models proposed in this paper are highlighted in color grey. Model CVR AUC CTCVR AUC Base 66.00 0.37 62.07 0.45 Oversampling [weiss2004mining] 67.18 0.32 63.05 0.48 ESMMNS [ma_entire_2018] 68.25 0.44 64.44 0.62 ESMM [ma_entire_2018] 68.56 0.37 65.32 0.49 MultiIPW 69.21 0.42 65.30 0.50 MultiDR 69.29 0.31 65.43 0.34 Table 3: Results of comparison study on Public dataset: AliCCP. Experiments are repeated 10 times and mean 1 std of AUC scores are reported below. The best scores are boldfaced in each column. The rows that contain the models proposed in this paper are highlighted in color grey. 
We observe that MultiIPW estimator is superior to Naive IPW estimator in all experiments. Note that there is a tradeoff between an independent, potentially more welltrained propensity model and the merit of multitask learning, i.e., information sharing between modules. We observe that by sacrificing the accuracy of propensity model to some extent, we can achieve better performance from a cotrained prediction model.

We notice that MultiDR has better performance than MultiIPW in most cases. Recall that MultiDR augments MultiIPW by introducing an imputation model. Provided that , the tail bound of MultiDR is proven to be lower than that of MultiIPW for any learned propensity score [wang_doubly_2019]. Therefore, MultiDR is expected to perform better than MultiIPW when the imputation model is welltrained. Note that MultiDR is more complex in design than MultiIPW. We need to pay the price of adding complexity in model implementation for the performance boost. In real practice, we also need to account for the difficulty of model development. Therefore, MultiIPW is designed for the scenario where quick and cheap model deployment is demanded, whereas MultiDR is preferable when we have enough time to search for a good imputation model and sampling rate.
The experiment results demonstrate that MultiIPW and MultiDR counter the selection bias and data sparsity issues in CVR estimation in a principled and highly effective way. In the next subsection, we will discuss other strengths of the proposed methods.
5.6 Computational efficiency (Q2)
In this part, we study the computational efficiency of the proposed CVR estimators against the baselines under industrial setting. We summarized the records of training time and parameter space size of each participant in Figure 5. The results prove that the proposed methods are costeffective. This comparison study is conducted in a distributed cluster with the configuration summarized in table 4.
We observe that joint learning DR and naive IPW demands over 30 hours of training for one epoch with the whole production dataset. Recall that naive IPW needs to train an independent propensity estimator before fitting the CVR predictor. This doubles the training time needed by multitask learning CVR estimators (e.g., ESMM, MultiDR, and MultiIPW). Bear in mind that multitask learning methods train CTR task, CVR task, and imputation task at the same time. As for joint learning DR, it trains the imputation model and prediction model alternatively. Hence, the training time it needs is at least doubled than the other estimators.
In addition, we notice that MultiDR and MultiIPW are also storage efficient compared with joint learning DR and naive IPW. Since the embedding parameters are shared among CTR task, CVR task and imputation task in the proposed models, we only need to save one copy of those trainable parameters in parameter severs. Note that MultiDR has more hidden layer parameters than ESMM and NaiveIPW, due to the additional imputation model. However, the memory usage of parameters in hidden layers is generally ignorable compared with the huge storage consumption of the embedding layers. In a nutshell, the proposed models achieve the best performance with the lowest computational cost.
Cluster configuration  Parameter Server  Worker 

# instances  4  100 
# CPU  28 cores  440 cores 
# GPU^{1}^{1}1GPU specs: Tesla P100PCIE16G    25 cards 
MEMORY (GB)  40  1000 
5.7 Hyperparameters in model implementation
IPW bound is a hyperparameter introduced in our model implementation to handle high variance of propensities. IPW bound clamps the propensities if the values are greater than the predefined threshold. A plausible IPW bound value is typically confined by . IPW bound percentage can be calculated as
. In MultiDR, imputation model will introduce the unclicked items to the training set. Empirically, most of the unclicked items will not be purchased by customers even if they were clicked. Therefore, including these unclicked items in training set will skew the data distribution and make the class imbalance issue worse. Therefore, Instead of adding all the unclicked samples, we undersample them with a sampling rate
. For example, if the number of clicked samples () is 100 and the batch size is 1000, means that after undersampling the samples we used to train MultiDR is . Note that without undersampling, MultiDR takes all samples in the batch as training samples.5.8 Empirical study on hyperParameter sensitivity (Q3)
In this section, we investigate how MultiIPW and MultiDR are affected by two important hyperparameters in our model implementation, IPW bound and sampling rate . We evaluate the performance of MultiIPW with varying IPW bound . We observe that in Figure 6, when IPW bound , prediction performance eventually improves as IPW bound increases. We can clearly see the performance drop of CVR AUC if the threshold is greater than . We reason that larger IPW bound allows undesired higher variability of propensity scores, which may lead to suboptimal prediction performance.
We evaluate the performance of MultiDR with varying sampling rate . We observe that produces the best prediction, and the model performance starts decreasing when . We argue that, as the sampling rate increases, more unclicked samples are included to our training set, and it inevitably worsens the class imbalance issue, which typically causes predictive models to generalize poorly. On the contrary, introducing a small number of unclicked samples from the imputation model can boost our CVR prediction (see figure (d) when ).
6 Conclusion and future works
In this paper, we proposed MultiIPW and MultiDR CVR estimators for industrial recommender system. Both CVR estimators aim to counter the inherent issues that exist in real practice: 1) selection bias, 2)data sparsity. Extensive experiments with billions data samples demonstrate that our methods outperform the stateoftheart CVR predictive models, and handle CVR estimation task in a principled, highly effective and efficient way. Our methods exploit the sequential events in ecommerce, "exposureclickconversion", by adopting the multitask learning approach to boost the model performance and reduce the model computational cost. Although our methods are devised for CVR estimation, the idea of our methods can be generalized to other prediction tasks in recommender systems. For example, in CTR estimation task, the recommender system typically chooses which items will be exposed to customers. The dataset for CTR estimation in this scenario might also suffer the data missing not at random. The sequential pattern in CTR prediction task can be summarized as "item pool exposureclick". Intuitively, our methods can be modified for this task. Future work may explore the possibilities of applying our models in various prediction tasks in industrial recommender systems.
Comments
There are no comments yet.