1. Introduction
Ecommerce recommender systems aim at not only helping users explore the items of their interests, but also increasing revenues for the platform. Therefore, estimating the postclick conversion rate (CVR), i.e., the probability of an item being purchased after it is clicked, is a crucial task for building such systems in practice. Moreover, postclick conversion feedbacks have been widely recognized as strong signals for the learning of the recommender systems, as they explicitly express the user preference and directly contribute to the gross merchandise volume (GMV) of the platform
(gmv; drcvr). However, it is very challenging to model such signals, which are extremely sparse in realworld applications. In particular, the postclick conversion feedbacks can only be observed in clicked events, which make up a tiny fraction of all possible user behaviors, while the other conversion feedbacks for unclicked events are missing. As such, a fundamental problem of CVR estimation is to study the missing mechanism in the postclick conversion feedbacks.For simplification, conventional CVR models usually assume that the missing conversion feedbacks are missingatrandom (MAR). Such assumption can barely hold under the selection bias and recent studies (rat; ipsimplicitlearn; drali) have shown that a recommendation model with MAR assumption often leads to suboptimal results. On realworld ecommerce platforms, as users are free to click the items that they are likely to want to purchase (i.e., user selfselection), the observed clicked events are not representative samples of all the events, which makes the missing conversions missingnotatrandom (MNAR). In other words, the fundamental reason behind the selection bias is that the users’ propensities vary from item to item. Here, the propensity is defined as the probability of an item being clicked by a user, i.e., the clickthrough rate (CTR). Hence, in this paper, we adopt the MNAR assumption when estimating the postclick conversion rate, and focus on addressing the selection bias problem.
In recent years, three unbiased estimators in counterfactual learning have been applied to debiasing the CVR estimation.
(1) The error imputation based (EIB) estimator (pmfdebias; eib) computes an imputed error, i.e., the estimated value of the prediction error, for each unclicked event, and then uses it to estimate the true prediction error of all the events. However, this estimator often has a large bias due to the inaccurate error imputation, which will easily mislead the CVR estimation. (2) The inverse propensity score (IPS) estimator (rat; gmcm; esmm) inversely weights the prediction error of each clicked event with its estimated CTR to correct the mismatch between the distributions of the clicked events and unclicked events. Although this estimator is unbiased given the groundtruth CTRs, it typically suffers from a high variance problem, which would lead to suboptimal results. (3) The doubly robust (DR) estimator (drjl; drali; dr) combines the EIB estimator and IPS estimator to ensure both the low variance and low bias. Its unbiasedness is guaranteed if either the imputed errors or the CTRs are accurate. This property is called the double robustness.Among the aforementioned estimators, the DR estimator has achieved initial success for debiasing recommender systems (drjl; drali; drcvr). However, there are still two inherent challenges to be solved. Despite the double robustness, the DR estimator may increase the variance of the IPS estimator under inaccurate error imputation, which makes the learning process even complicated and leads to suboptimal results. Hence, further variance reduction for the DR approaches deserves to investigate. Furthermore, although the DR estimator is more robust than the EIB and IPS estimator, it still requires relatively accurate CTR estimation and error imputation. In terms of the two tasks, the former has been extensively investigated by a lot of works (ctrdeepfm; ctrdin)
, whereas the latter rarely investigated. To estimate the imputed errors, previous DR based approaches typically introduce an extra imputation model that is agnostic of the prediction model, such as linear regression
(dr), matrix factorization (drjl), multilayer perceptron (MLP)
(drali), etc. Here, the imputed errors, utilized as the gradient directions of the prediction model, should be dynamically changing during its learning process. However, simply using modelagnostic methods are not sufficient to approximate such a modelcorrelated target. Thus, it still calls for a better solution on how to model the error imputation.To address the abovementioned challenges, we propose the enhanced doubly robust learning approach for debiasing postclick conversion rate estimation. To tackle the first challenge, we propose to reduce the variance of the DR estimator, by redesigning the goal of the imputation learning (i.e., the learning process of the imputation model) as the minimization of its variance (mrdr; mrdrinit). Specifically, we derive the bias and variance of the DR estimator, based on which we propose the more robust doubly robust (MRDR) estimator as a variant of the DR estimator to derive lower variance while retaining the double robustness. Moreover, inspired by Double DQN (doubleDQN)
in reinforcement learning, we propose a novel double learning approach for the MRDR estimator to tackle the second limitation. In particular, we adopt two CVR models with same structure but different parameters. The first one serves as the prediction model to learn from both the imputed errors and the true prediction errors for final CVR estimation. The second one serves as the imputation model to generate the pseudo label using its predicted CVR for each event. During the learning of the prediction model, the imputed error can be directly computed with the pseudo label and the predicted CVR. As such, we convert the error imputation into the general CVR estimation, and further, the imputed errors can be dynamically estimated in a modelcorrelated way. For the learning of both models, we alternate their learning process to enable them to be mutually regularized. In addition, we periodically update the parameters of the imputation model with the parameters of the prediction model, which is empirically beneficial for eliminating the high variance problem of the imputation learning. Extensive experiments are conducted on both semisynthetic and realworld datasets to verify the effectiveness of both the proposed MRDR estimator and double learning approach.
The main contributions of this work are summarized as follows.

We conduct theoretical analysis on the bias and variance of the DR estimator, based on which we propose the more robust doubly robust (MRDR) estimator. It can achieve further variance reduction while retaining the double robustness.

To dynamically utilize the information of the prediction model for error imputation, we propose a novel double learning approach for the MRDR estimator, which is also empirically beneficial for addressing the high variance problem of the imputation learning.

Experimental results on the semisynthetic dataset empirically verify the effectiveness of the proposed MRDR estimator. Furthermore, we conduct extensive experiments on two realworld datasets. The results show that the proposed enhanced doubly robust learning approach MRDRDL outperforms the stateoftheart methods.
2. Preliminaries
In this section, we detail the problem formulation, and introduce some existing unbiased estimators in the postclick conversion setting.
2.1. Problem Formulation
Let be the set of users, be the set of items, and be the set of all useritem pairs. We denote as the conversion label matrix where each entry indicates whether a conversion action occurs after user clicks item . We use to represent the predicted conversion rate matrix, where represents the conversion rate predicted by a model. If we have a fully observed conversion label matrix
, the ideal loss function for minimization can be formulated as
(1) 
where is the prediction error. We usually adopt the cross entropy, as the optimization goal for binary classification. Let be the click indicator matrix with each entry if user clicks item , and 0 otherwise. Since only postclick conversions for clicked events can be observed, the naive estimator estimates the ideal loss function by averaging the prediction error for clicked events as
(2)  
where denotes the clicked events. The naive estimator is intuitive and widely adopted by many existing methods. However, due to the selection bias, the conversions for unclicked events are MNAR, which leads to a biased estimation, i.e., . Previous works (rat; drjl; ipsimplicitlearn) have proved that the learning process based on a biased estimator often leads to a suboptimal prediction model. Hence, it is essential to develop an unbiased estimator to address the MNAR problem. In the following, we will introduce three existing unbiased estimators.
2.2. Error Imputation Based Estimator
The error imputation based (EIB) estimator introduces an imputation model to compute the imputed error , i.e., the estimated value of the prediction error (eib; pmfdebias). Leveraging the imputed errors for unclicked events and the prediction errors for clicked events, we estimate the ideal loss function with the EIB estimator as
(3) 
When the imputed error is accurate for any given unclicked event, the EIB estimator is unbiased, i.e., . However, the EIB estimator can hardly achieve accurate error imputation, and thus often has a large bias in practice, which would easily mislead the learning of the prediction model.
2.3. Inverse Propensity Score Estimator
The inverse propensity score (IPS) estimator (rat; ipsimplicitlearn; gmcm) weights each clicked event with , where the propensity refers to the probability of the item being clicked by the user, i.e., the clickthrough rate (CTR) in the postclick conversion setting. By introducing an auxiliary CTR task to estimate the propensity , the IPS estimator can be formulated as
(4) 
The IPS estimator derives an unbiased estimate of the ideal loss function, i.e., , when the estimated propensity is accurate for any given clicked event. However, as the clicked events merely account for a small part of , the CTR is typically assigned with a small value. Hence, the IPS estimator suffers from an especially severe high variance problem.
2.4. Doubly Robust Estimator
To address the large bias problem of the EIB estimator and the high variance problem of the IPS estimator, the doubly robust (DR) estimator is adopted by many previous works (dr; drjl; drali). It combines the EIB estimator and the IPS estimator in a doubly robust way. Particularly, this estimator uses the imputed errors to estimate the prediction errors for all the events, and correct the error deviation for the unclicked events. The propensity is inversely weighted to the error deviation for eliminating the MNAR effect. The loss function of the DR estimator can be defined as
(5) 
The DR estimator is unbiased, i.e., , if either the imputed error of any event or the propensity of any clicked event is accurate. This property is recognized as double robustness. To compute the imputed errors, previous works typically introduce a separate imputation model. Since the imputation learning is actually a regression problem, DR uses the squared loss,
(6) 
to train the imputation model. The inverse propensity score is weighted to consider the MNAR effect, which also leads to the high variance problem of the imputation learning.
3. Enhanced Doubly Robust Learning Approach
In this section, we elaborate the proposed enhanced doubly robust learning approach. We first analyze the bias and variance of the doubly robust estimator, based on which we propose the more robust doubly robust estimator for further variance reduction. Then, we detail the proposed novel double learning approach for the MRDR estimator.
3.1. Bias and Variance Analysis of DR Estimator
Initially, we formulate the bias of the DR estimator to prove its double robustness.
Theorem 3.1 ().
Let denote the additive error deviation, and the multiplicative propensity deviation. Then, the bias of the DR estimator is
(7) 
Proof.
See Theorem 3.2 in (drali) for the proof. ∎
As shown in Theorem 3.1, the DR estimator is close to the ideal loss function, i.e., , if either or , whereas the EIB estimator requires and the IPS estimator requires . This property is called double robustness. Then, we derive the variance of the DR estimator.
Theorem 3.2 ().
The variance of the DR estimator is
(8) 
Proof.
For a single term of the DR estimator, its variance on the click indicator is
(9) 
Then, summing across all terms of the DR estimator, we can derive the variance:
(10)  
∎
Similarly, we can derive the variance of the IPS estimator as
(11) 
Theorem 3.2 and Equation 11 illustrate that the variance of both estimators depends on the estimated propensity, i.e., the predicted CTR , which may lead to a high variance problem. However, it is worth noting that the DR estimator still reduces the variance of the IPS estimator, if any given event satisfies .
3.2. More Robust Doubly Robust Estimator
The theoretical analysis demonstrates that despite the double robustness, the DR estimator suffers from the risk of increasing the variance of the IPS estimator under inaccurate error imputation. Hence, we propose a more robust doubly robust (MRDR) estimator for further variance reduction. Specifically, we propose to learn the imputation model of the DR estimator by minimizing its variance. In other words, it is a variation of the DR estimator, and the only difference is that its loss function for imputation learning is derived from minimizing DR’s variance. This means that the proposed MRDR estimator not only retains the double robustness, but also derives a lower variance than the original DR estimator. Based on Equation 9, we take the expectation on and estimate with to derive the loss function of the imputation learning in the MRDR estimator as
(12)  
Comparing the loss function of imputation learning in MRDR with that in DR, we note that MRDR changes the weights from to , which has the property
(13) 
As such, the MRDR estimator increases the penalty of the clicked events with low propensity, and decreases the penalty of the rest of the clicked events. In this way, the imputation model is learned better, which further enables MRDR to reduce the variance of the DR estimator.
3.3. Double Learning Approach
In this subsection, we detail the proposed double learning approach for the MRDR estimator. Given a vector
encoding all the features of user and item , previous works typically introduce two separate models: an imputation model estimates the imputed errors, and a prediction model learns from the imputed errors and true conversion labels to predict the CVR. Here, the imputation model is agnostic of the prediction model, and merely takes the useritem features for error imputation. In other words, during the learning process of the prediction model, the imputed error cannot be dynamically estimated. From an optimization perspective, the imputation model plays the role of estimating the gradients for the learning of the prediction model. However, we argue that simply utilizing modelagnostic methods is not sufficient to approximate such a modelcorrelated target. To this end, we propose a novel double learning approach, which utilizes the pseudolabelling technique to provide dynamically changing imputed errors for the prediction model. As such, the complicated error imputation is simplified as a general CVR estimation task. We show the workflow of the double learning approach in Figure 1.Specifically, we introduce two models with the same structure but different parameters: the prediction model and the imputation model . When we need to learn the prediction model, we first generate the pseudo label for each event based on the imputation model. Then, we estimate the imputed error by computing the cross entropy between the predicted conversion rate and the pseudo label , i.e., . In this way, the imputation model is converted to a CVR estimation task; further to this, the original regression problem is converted to a binary classification problem. Therefore, we replace the squared loss term in Equation 12 with a crossentropy term. The imputation learning process of the MRDR estimator is thereby redesigned as
(14) 
where denotes all the parameters of the imputation model and controls the regularization strength to prevent overfitting. Note that, although we change the original formulation of the loss function of the imputation model in MRDR, the idea of increasing the penalty of the lowpropensity clicked events and decreasing the penalty of the rest is kept. Meanwhile, we formulate the learning of the prediction model as
(15) 
where , and denote all the parameters of the prediction model , the prediction error, and the imputed error, and controls the regularization strength to prevent overfitting.
Inspired by Double DQN (doubleDQN)
, we redesign the learning approach of the both models. Generally, we alternate the learning process between the imputation model and the prediction model via minibatch stochastic gradient descent. As such, two models regularize each other and jointly reach convergence. Since that the MRDR estimator merely enhances the inverse propensity weight of the imputation learning into
, it suffers from the high variance of the imputation learning, which also happens in the DR estimator as mentioned in Section 2.4. Therefore, each time before learning the imputation model, we update its parameters with those of the prediction model, i.e., . In this way, the imputation model will be periodically corrected, and the information that the enhanced inverse propensity weight brings is kept. We empirically demonstrate that such learning scheme is beneficial for eliminating the high variance problem of the imputation learning. We summarize the proposed enhanced doubly robust learning approach, named MRDRDL, in Algorithm 1.4. Semisynthetic Experiments
Following previous works (rat; drjl; drcvr), we conduct semisynthetic experiments to investigate the following research question (RQ).

Does the MRDR estimator lead to more accurate loss estimation than other estimators?
4.1. Experimental Setup
4.1.1. Dataset and preprocessing
ML 100K  Coat Shopping  Yahoo! R3  

#users  943  290  15400 
#items  1682  300  1000 
#MNAR ratings  100000  6960  311704 
#MAR ratings  0  4640  54000 
To compute the accuracy of the estimated loss, we need a fully observed conversion label matrix, which is unavailable in realworld datasets. Thus, we create a semisynthetic evaluation dataset using the MovieLens (ML) 100K
^{2}^{2}2https://grouplens.org/datasets/movielens/ (movielens) dataset in order to allow us to conduct the semisynthetic experiment. The statistical details of the dataset are presented in Table 1. We employ the following preprocessing procedures (drcvr) to convert the explicit feedback setting to the postclick conversion setting, and derive a fully observed conversion label matrix and a click indicator matrix.(1) Use matrix factorization (mf) to complete the rating matrix, but the predicted ratings are unrealistically high for all useritem pairs. To match a more realistic rating distribution given in (ratio), we sort all the ratings in ascending order, assign a value of 1 to the bottom fraction of the matrix entries, assign a value of 2 to the next fraction, and so on.
(2) Transform the predicted ratings into CTR with , where is set to 1 and is set to 0.5 in our experiments.
(3) Transform the predicted ratings into true CVR by correspondingly replacing the rating with the conversion rate . Note that we can only observe the binary conversion labels rather than the true values of the CVR in practice. Thus, we simply assign fixed values to them based on different predicted ratings.
(4) Sample the binary click indicator and conversion label with the Bernoulli sampling; that is,
(16) 
where
denotes the Bernoulli distribution. Thereafter, we can derive a fullyobserved conversion label matrix
and a click indicator matrix .4.1.2. Experimental details
Given a predicted CVR matrix , we can directly compute the ideal loss by averaging the prediction error of each entry between and . In contrast, the estimators derive the estimated loss with partial entries in whose corresponding click indicators equal 1. To evaluate the performance of loss estimation, we use the following five predicted CVR matrices (rat; drjl) for comparison.

ONE: The predicted conversion rate is identical to the true CVR , except that randomly selected true CVR of 0.1 are flipped to 0.9.

THREE: Same as ONE, but flipping true CVR of 0.3 instead.

FIVE: Same as ONE, but flipping true CVR of 0.5 instead.

CRS: The predicted conversion rate if the true CVR . Otherwise, .
We compare the MRDR estimator with the naive, EIB, IPS, and DR estimators. We estimate the propensity as , where , and is set to 0.5 to introduce noises. For EIB and DR, the imputed error is computed as . For MRDR, we compute the imputed errors as .
4.1.3. Evaluation metric
We compare the performance of the five estimators by the relative error (RE) as
(17) 
where denotes the estimator to be compared. RE evaluates the accuracy of the estimated loss, and a smaller value of MRE means a higher accuracy.
4.2. Experiment Results (RQ1)
naive  EIB  IPS  DR  MRDR  

ONE  0.0686  0.5427  0.0346  0.0131  0.0073 
THREE  0.0792  0.5869  0.0401  0.0172  0.0047 
FIVE  0.1023  0.6152  0.0515  0.0138  0.0119 
SKEW  0.0255  0.3574  0.0124  0.0081  0.0013 
CRS  0.1773  0.0610  0.0888  0.0551  0.0503 
In Table 2, we report the averaged RE of the five estimators over 20 times of sampling with Equation 16. We can see that IPS, DR and MRDR estimators outperform the naive estimator in every setting. This is caused by the selection bias that we introduce by controlling the propensity
. In contrast, the EIB estimator derives the worst RE in four settings; this is mainly due to the large bias of the heuristic error imputation. Additionally, the DR estimator improves the performance of the IPS estimator by jointly considering the imputed errors and the estimated propensities. Over all the settings, the MRDR estimator achieves the best performance, which can be attributed to both the double robustness and the reduced variance. Overall, the results conclude that our proposed method can achieve more accurate loss estimation. Next, we further evaluate our method on the task of CVR estimation on realworld datasets.
5. Realworld Experiments
DCG@K  Recall@K  

Datasets  Methods  K=2  K=4  K=6  K=2  K=4  K=6 
Coat Shopping  naive  0.66940.0136  0.94320.0138  1.13210.0126  0.80540.0159  1.39030.0225  1.89910.0233 
IPS  0.70930.0232  0.95520.0223  1.12480.0217  0.82490.0298  1.35200.0353  1.80780.0399  
DRJL  0.67710.0273  0.92660.0282  1.09620.0272  0.79490.0337  1.32860.0420  1.78490.0456  
MRDRDL  0.72190.0211  0.99050.0204  1.16960.0217  0.84990.0265  1.42490.0321  1.90600.0430  
Yahoo! R3  Naive  0.54690.0058  0.74660.0049  0.87140.0040  0.64790.0066  1.07450.0074  1.40980.0062 
IPS  0.55020.0018  0.75200.0018  0.87510.0014  0.65450.0021  1.07970.0025  1.41680.0025  
DRJL  0.53100.0045  0.72730.0053  0.85120.0045  0.62920.0049  1.04950.0082  1.38220.0081  
MRDRDL  0.55610.0058  0.75490.0023  0.88110.0036  0.65950.0074  1.08460.0054  1.42370.0059 
In this section, we compare the proposed learning approach with other existing debiasing approaches using realworld datasets. We anticipate the experimental results to answer the following RQs.

Does the proposed approach MRDRDL lead to higher debiasing performance than existing approaches?

What influence do the various designs have on the proposed approach MRDRDL?

How does the sample ratio of unclicked events to clicked events influence the performance of MRDRDL?

How does the proposed double learning approach work for both the imputation model and the prediction model?
5.1. Experimental Setup
5.1.1. Datasets and preprocessing
To evaluate the performance of the unbiased CVR estimation, we need an MAR test set. However, as stated in (drali), we cannot force users to randomly click items in order to generate unbiased data for CVR estimations. Previous work (drcvr) simulates the unbiased CVR estimation setting by using the datasets with specific properties. First, the datasets need to contain explicit feedback, which can reveal groundtruth user preference information. Next, the datasets need to contain an additional MAR test set, where users are asked to rate randomly selected sets of items. This enables us to evaluate the performance of the unbiased CVR estimation. To the best of our knowledge, there are only two publicly available datasets that satisfy these requirements, i.e., Coat Shopping^{3}^{3}3https://www.cs.cornell.edu/~schnabts/mnar and Yahoo! R3^{4}^{4}4http://webscope.sandbox.yahoo.com/. The statistical details for both datasets are presented in Table 1.
For both the MNAR data and the MAR data of both datasets, we follow (drcvr) and employ the following preprocessing procedure.
(1) We define the binary click indicator as if the item is rated by user , and otherwise.
(2) We define the binary conversion label as if the item is rated greater than or equal to 4 by user , and otherwise.
(3) We derive the postclick conversion dataset as .
For both datasets, we randomly split the MNAR datasets into training (90%) and validation (10%) sets, while the MAR datasets are kept as the test sets. Following the previous works (ipsimplicit; drcvr), we filter out users who have no conversion records in the test set.
5.1.2. Baselines
We compare the proposed method with the following baselines:

Naive: It simply uses the naive estimator as the loss function to estimate CVR.

IPS (rat): It derives the IPS estimator as the loss function by estimating the CTR as the propensity score.

DRJL (drjl): It utilizes the DR estimator by jointly learning the imputation model and prediction model.
Due to the high bias problem, the EIB estimator is widely recognized as a weak baseline (drjl; drcvr; rat), and thus is not included in our comparison. In our experiments, both the CTR and the CVR are estimated by factorization machine (FM) (fm).
5.1.3. Experimental Protocols
We adopt the minibatch Adam to optimize all the methods, with the default learning rate set at 0.001. We fix the minibatch size to 1024 for both datasets. In terms of FM, the embedding size is fixed as 64. We tune the regularization coefficient in the range of . Note that, for DR based methods, we apply a grid search when tuning the regularization coefficient of the imputation model and the prediction model; also, the sample ratio for unclicked events to clicked events is tuned in the range of . For CTR estimation, we fix the negative sampling ratio to 4.
For all the methods, we first choose the best hyperparameters based on the validation set. Then, we perform the early stopping strategy (which applies if the model performance does not improve for five epochs) and report the best test result from the bestperforming model on the validation set.
We use recall and discounted cumulative gain (DCG) to evaluate the debiasing performance of all the methods. We calculate both metrics for each user in the test set and report the average score.
5.2. Overall Performance (RQ2)
Table 3 shows the overall performance in terms of DCG@K and Recall@K (
) on two realworld datasets. To reduce the effect of randomness, we repeat the experiments 100 times for Coat Shopping and 20 times for Yahoo! R3, and then report the mean and standard deviation for each. The best results are highlighted in boldface. From the table, we can see that the debiasing methods, IPS and MRDRDL, outperform Naive for both datasets, demonstrating the necessity of handling the selection bias in the CVR estimation. Meanwhile, we find that although DRJL utilizes the unbiased DR estimator, it still gives the worst performance on both datasets. One possible explanation for this is that DRJL is originally designed for debiasing explicit MNAR feedbacks; as such, its joint learning approach may not be applicable to CVR estimation. Overall, the proposed method MRDRDL consistently outperforms other methods on both datasets, which verifies the effectiveness of both the proposed MRDR estimator and the double learning approach.
5.3. Ablation Study (RQ3)
DCG@K  Recall@K  

Datasets  Methods  K=2  K=3  K=4  K=5  K=6  K=2  K=3  K=4  K=5  K=6 
Coat Shopping  MRDRDL  0.7219  0.8728  0.9905  1.0878  1.1695  0.8499  1.1518  1.4249  1.6765  1.9060 
DRDL  0.7205  0.8670  0.9806  1.0778  1.1601  0.8438  1.1368  1.4004  1.6517  1.8827  
MRDRJL  0.6948  0.8442  0.9613  1.0582  1.1442  0.8227  1.1215  1.3935  1.6439  1.8852  
MRDRDL with SL  0.7255  0.8720  0.9871  1.0827  1.1651  0.8504  1.1434  1.4107  1.6580  1.8892  
Yahoo! R3  MRDRDL  0.5561  0.6694  0.7549  0.8234  0.8811  0.6595  0.8860  1.0846  1.2616  1.4237 
DRDL  0.5463  0.6602  0.7459  0.8145  0.8714  0.6484  0.8762  1.0752  1.2525  1.4123  
MRDRJL  0.5546  0.6668  0.7544  0.8221  0.8786  0.6584  0.8828  1.0862  1.2612  1.4199  
MRDRDL with SL  0.5321  0.6439  0.7287  0.7963  0.8538  0.6298  0.8535  1.0503  1.2251  1.3863 
To apply the DR estimator to the postclick conversion setting, the proposed method, MRDRDL, has specific design features. In this section, we will analyze their respective impacts on the method’s performance via an ablation study. The experimental results for MRDRDL and its three variants on two datasets are summarized in Table 4. The results that are better than MRDRDL are highlighted in boldface. We detail the variants and analyze their respective effects as follows.
(1) DRDL: We replace the MRDR estimator with the DR estimator, i.e., we change the weights of the imputation learning from to . The results imply that enhancing the weights to adjust the penalty for the clicked events based on varying propensities is conducive to the variance reduction of the DR estimator, and further improves the performance of the prediction model.
(2) MRDRJL: We alternate the training of the imputation model and the prediction model without sharing the parameters periodically (i.e., we skip Step 3 of the Algorithm 1). The experiment results verify the necessity of periodically correcting the imputation model based on the prediction model, which is empirically beneficial for eliminating the high variance problem of the imputation learning.
(3) MRDRDL with SL: We replace the crossentropy term of the imputation learning with the squared loss term, which is theoretically derived from the variance of the DR estimator. The results of the variant are consistent with MRDRDL on Coat Shopping and significantly better than MRDRDL on Yahoo! R3. One reason is that the squared loss aims at minimizing the deviation between imputed errors and true prediction errors, whereas the pseudolabel generation is essentially a binary classification problem. Hence, it is more intuitive to have cross entropy as the optimization goal.
5.4. Parameter Sensitivity Study (RQ4)
By jointly considering both clicked and unclicked events, DR based estimators can enjoy the double robustness. To investigate the impact of the unclicked events for the proposed MRDRDL method, we vary the sample ratio of unclicked events to clicked events in the range of {0, 2, 4, 6, 8, All}. Here, ”All” means that the sample ratio is set to the maximum possible value, which is 12.5 for Coat Shopping and 49.4 for Yahoo! R3. Figure 2 shows DCG@K and Recall@K for MRDRDL with respect to different sample ratios on both datasets. As we can see, MRDRDL with a sample ratio of 0 (i.e., we merely sample from the clicked events) derives the worst performance in most settings. This shows that the welllearned imputation model enables the unclicked events to provide the prediction model with useful information. Furthermore, we find that sampling from all the events adversely hurts the performance of the prediction model, even though we should in theory. One reason for this might be that clicked events are typically sparse in the realworld datasets, meaning that we cannot ensure that the prediction model obtains sufficient information. For both datasets, the optimal sample ratio is around 4 to 8. Setting the sample ratio too conservatively or too aggressively may adversely affect the prediction performance.
5.5. Analysis of the Double Learning Approach (RQ5)
In this subsection, we will further investigate the proposed double learning approach. We plot the training curves of the prediction model and the imputation model of MRDRDL on Coat Shopping in Figure 3. In the proposed method MRDRDL, the prediction model aims at estimating CVR, while the imputation model aims at computing the imputed errors. Hence, we adopt DCG@4, and mean absolute error (MAE) for imputed errors and true prediction errors, respectively, to evaluate their testing performance. As shown in Figure 2, the training loss of the prediction model slightly fluctuates in the first 300 epochs before gradually reaching convergence. In contrast, the training curve of the imputation model is more stable. The reason for this is that the imputation model is not welltrained enough to provide the prediction model with sufficiently accurate information at the very beginning, whereas the imputation model is trained on clicked events with groundtruth labels. Further epochs of the double learning approach enable both models to exchange their information periodically. In this way, both models are jointly welllearned, reaching convergence together after about 900 epochs. Note that the testing curve of the prediction model fluctuates in the training process. This is reasonable because we train it with pointwise loss (i.e., cross entropy), whereas we evaluate it with a listwise metric (i.e., DCG@4) when checking its debiasing performance.
6. Related Work
6.1. General Approaches to CVR Estimation
CVR estimation is a key component of the recommender system because it directly contributes to the final revenue. Due to the inherent similarity, CVR estimation typically refers to the advances made by the CTR prediction task and implicit recommendation in practice, such as traditional models (ydpoi; ctrfm)
, deep learning based models
(ctrdeepfm; ctrdin; deepctrhfm; wwdl) and reinforcement learning based models (lxrl1; lxrl2; lxrl3). However, few studies directly investigate the CVR estimation tasks. Previous works often employ traditional models such as logistic regression
(cvrlr1; cvrlr2)and gradient boosting decision tree
(gbdt), while deep learning techniques like neural network
(esmm; esm2) and graph convolution network (gmcm; ydgcn) are also adopted for CVR estimation. However, the selection bias issue is still underexplored, which has a significant influence on improving performance in practice.6.2. Counterfactual Learning from MNAR Data
Most data for learning the recommender systems are MNAR, which is caused by various biases, including selection bias, conformity bias, exposure bias, etc (survey). Previous works typically adopt counterfactual learning methods to address these issues. Specifically, EIB methods (pmfdebias; eib) rely on a missing data model to model the missing mechanism. IPS methods employ logistic regression (rat)
, expectationmaximization algorithm
(wsdm21), and matrix completion (1bitmc) to estimate the propensity for correcting the mismatch between observed and unobserved data. DR methods (drali; drjl) utilize an imputation model and a prediction model to jointly learn from the MNAR data. Other methods based on information bottleneck (cvib), meta learning (at), and causal embedding (cause; dice) have been also explored to address these biases. Among above, IPS and DR has been widely applied to the recommender systems. However, how to specify appropriate error imputation and propensity estimation is a critical issue affecting their unbiasedness, which needs to be resolved in the postclick conversion setting.6.3. Selection Bias in CVR Estimation
Selection bias is ubiquitous in recommender systems, especially in the CVR estimation task. A few studies have investigated it, achieving effective results. ESMM (esmm) models both the CTR and CVR tasks, using mutitask learning to eliminate the selection bias issue in a heuristic way. Similarly, (esm2), which is also essentially biased, extends ESMM by introducing additional auxiliary tasks. In contrast, GMCM (gmcm) uses the IPS estimator to derive unbiased error evaluation when learning the CVR estimation task. In addition, MultiIPW and MultiDR (drali) enjoy the unbiasedness of the IPS and DR estimator by learning both CTR and CVR tasks through multitask learning. Although they consider the selection bias, the above approaches are evaluated using biased datasets; thus, their experimental results cannot be used to verify their debiasing performance, which is a widelyrecognized limitation in practice. Furthermore, a recent work (drcvr) proposes to utilize the DR estimator for debiasing ranking metric with postclick conversions, which mainly concerns the evaluation of the recommender systems. In contrast, we focus on debiasing the learning of the CVR estimation, and we use two realworld datasets containing unbiased data to evaluate the debiasing performance.
7. Conclusion and Future Work
In this paper, we explore the problem of the selection bias in postclick conversion rate estimation. First, we analyze the bias and the variance of the DR estimator. Then, based on the theoretical analysis, we propose the more robust doubly robust estimator, which reduces the variance of the DR estimator while retaining the double robustness. Finally, we propose a novel double learning approach for MRDR estimator. It can dynamically utilize the information of the prediction model for the imputation model and empirically eliminate the high variance problem of the imputation learning. In the experiments, we verify the effectiveness of the proposed MRDR estimator on semisynthetic datasets. In addition, we conduct extensive experiments on two realworld datasets to demonstrate the superiority of the proposed debiasing approach. For future work, we believe that the explainability (taert) of the debiasing approach warrants further investigation.
Comments
There are no comments yet.