Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation

05/28/2021 ∙ by Siyuan Guo, et al. ∙ Association for Computing Machinery Jilin University 0

Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems. However, accurately estimating the post-click conversion rate (CVR) is challenging due to the selection bias, i.e., the observed clicked events usually happen on users' preferred items. Currently, most existing methods utilize counterfactual learning to debias recommender systems. Among them, the doubly robust (DR) estimator has achieved competitive performance by combining the error imputation based (EIB) estimator and the inverse propensity score (IPS) estimator in a doubly robust way. However, inaccurate error imputation may result in its higher variance than the IPS estimator. Worse still, existing methods typically use simple model-agnostic methods to estimate the imputation error, which are not sufficient to approximate the dynamically changing model-correlated target (i.e., the gradient direction of the prediction model). To solve these problems, we first derive the bias and variance of the DR estimator. Based on it, a more robust doubly robust (MRDR) estimator has been proposed to further reduce its variance while retaining its double robustness. Moreover, we propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation. Besides, we empirically verify that the proposed learning scheme can further eliminate the high variance problem of the imputation learning. To evaluate its effectiveness, extensive experiments are conducted on a semi-synthetic dataset and two real-world datasets. The results demonstrate the superiority of the proposed approach over the state-of-the-art methods. The code is available at https://github.com/guosyjlu/MRDR-DL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

E-commerce recommender systems aim at not only helping users explore the items of their interests, but also increasing revenues for the platform. Therefore, estimating the post-click conversion rate (CVR), i.e., the probability of an item being purchased after it is clicked, is a crucial task for building such systems in practice. Moreover, post-click conversion feedbacks have been widely recognized as strong signals for the learning of the recommender systems, as they explicitly express the user preference and directly contribute to the gross merchandise volume (GMV) of the platform

(gmv; dr-cvr). However, it is very challenging to model such signals, which are extremely sparse in real-world applications. In particular, the post-click conversion feedbacks can only be observed in clicked events, which make up a tiny fraction of all possible user behaviors, while the other conversion feedbacks for unclicked events are missing. As such, a fundamental problem of CVR estimation is to study the missing mechanism in the post-click conversion feedbacks.

For simplification, conventional CVR models usually assume that the missing conversion feedbacks are missing-at-random (MAR). Such assumption can barely hold under the selection bias and recent studies (rat; ips-implicit-learn; dr-ali) have shown that a recommendation model with MAR assumption often leads to sub-optimal results. On real-world e-commerce platforms, as users are free to click the items that they are likely to want to purchase (i.e., user self-selection), the observed clicked events are not representative samples of all the events, which makes the missing conversions missing-not-at-random (MNAR). In other words, the fundamental reason behind the selection bias is that the users’ propensities vary from item to item. Here, the propensity is defined as the probability of an item being clicked by a user, i.e., the click-through rate (CTR). Hence, in this paper, we adopt the MNAR assumption when estimating the post-click conversion rate, and focus on addressing the selection bias problem.

In recent years, three unbiased estimators in counterfactual learning have been applied to debiasing the CVR estimation.

(1) The error imputation based (EIB) estimator (pmf-debias; eib) computes an imputed error, i.e., the estimated value of the prediction error, for each unclicked event, and then uses it to estimate the true prediction error of all the events. However, this estimator often has a large bias due to the inaccurate error imputation, which will easily mislead the CVR estimation. (2) The inverse propensity score (IPS) estimator (rat; gmcm; esmm) inversely weights the prediction error of each clicked event with its estimated CTR to correct the mismatch between the distributions of the clicked events and unclicked events. Although this estimator is unbiased given the ground-truth CTRs, it typically suffers from a high variance problem, which would lead to sub-optimal results. (3) The doubly robust (DR) estimator (drjl; dr-ali; dr) combines the EIB estimator and IPS estimator to ensure both the low variance and low bias. Its unbiasedness is guaranteed if either the imputed errors or the CTRs are accurate. This property is called the double robustness.

Among the aforementioned estimators, the DR estimator has achieved initial success for debiasing recommender systems (drjl; dr-ali; dr-cvr). However, there are still two inherent challenges to be solved. Despite the double robustness, the DR estimator may increase the variance of the IPS estimator under inaccurate error imputation, which makes the learning process even complicated and leads to sub-optimal results. Hence, further variance reduction for the DR approaches deserves to investigate. Furthermore, although the DR estimator is more robust than the EIB and IPS estimator, it still requires relatively accurate CTR estimation and error imputation. In terms of the two tasks, the former has been extensively investigated by a lot of works (ctr-deepfm; ctr-din)

, whereas the latter rarely investigated. To estimate the imputed errors, previous DR based approaches typically introduce an extra imputation model that is agnostic of the prediction model, such as linear regression

(dr), matrix factorization (drjl)

, multilayer perceptron (MLP)

(dr-ali), etc. Here, the imputed errors, utilized as the gradient directions of the prediction model, should be dynamically changing during its learning process. However, simply using model-agnostic methods are not sufficient to approximate such a model-correlated target. Thus, it still calls for a better solution on how to model the error imputation.

To address the above-mentioned challenges, we propose the enhanced doubly robust learning approach for debiasing post-click conversion rate estimation. To tackle the first challenge, we propose to reduce the variance of the DR estimator, by redesigning the goal of the imputation learning (i.e., the learning process of the imputation model) as the minimization of its variance (mrdr; mrdr-init). Specifically, we derive the bias and variance of the DR estimator, based on which we propose the more robust doubly robust (MRDR) estimator as a variant of the DR estimator to derive lower variance while retaining the double robustness. Moreover, inspired by Double DQN (doubleDQN)

in reinforcement learning, we propose a novel double learning approach for the MRDR estimator to tackle the second limitation. In particular, we adopt two CVR models with same structure but different parameters. The first one serves as the prediction model to learn from both the imputed errors and the true prediction errors for final CVR estimation. The second one serves as the imputation model to generate the pseudo label using its predicted CVR for each event. During the learning of the prediction model, the imputed error can be directly computed with the pseudo label and the predicted CVR. As such, we convert the error imputation into the general CVR estimation, and further, the imputed errors can be dynamically estimated in a model-correlated way. For the learning of both models, we alternate their learning process to enable them to be mutually regularized. In addition, we periodically update the parameters of the imputation model with the parameters of the prediction model, which is empirically beneficial for eliminating the high variance problem of the imputation learning. Extensive experiments are conducted on both semi-synthetic and real-world datasets to verify the effectiveness of both the proposed MRDR estimator and double learning approach.

The main contributions of this work are summarized as follows.

  • We conduct theoretical analysis on the bias and variance of the DR estimator, based on which we propose the more robust doubly robust (MRDR) estimator. It can achieve further variance reduction while retaining the double robustness.

  • To dynamically utilize the information of the prediction model for error imputation, we propose a novel double learning approach for the MRDR estimator, which is also empirically beneficial for addressing the high variance problem of the imputation learning.

  • Experimental results on the semi-synthetic dataset empirically verify the effectiveness of the proposed MRDR estimator. Furthermore, we conduct extensive experiments on two real-world datasets. The results show that the proposed enhanced doubly robust learning approach MRDR-DL outperforms the state-of-the-art methods.

2. Preliminaries

In this section, we detail the problem formulation, and introduce some existing unbiased estimators in the post-click conversion setting.

2.1. Problem Formulation

Let be the set of users, be the set of items, and be the set of all user-item pairs. We denote as the conversion label matrix where each entry indicates whether a conversion action occurs after user clicks item . We use to represent the predicted conversion rate matrix, where represents the conversion rate predicted by a model. If we have a fully observed conversion label matrix

, the ideal loss function for minimization can be formulated as

(1)

where is the prediction error. We usually adopt the cross entropy, as the optimization goal for binary classification. Let be the click indicator matrix with each entry if user clicks item , and 0 otherwise. Since only post-click conversions for clicked events can be observed, the naive estimator estimates the ideal loss function by averaging the prediction error for clicked events as

(2)

where denotes the clicked events. The naive estimator is intuitive and widely adopted by many existing methods. However, due to the selection bias, the conversions for unclicked events are MNAR, which leads to a biased estimation, i.e., . Previous works (rat; drjl; ips-implicit-learn) have proved that the learning process based on a biased estimator often leads to a sub-optimal prediction model. Hence, it is essential to develop an unbiased estimator to address the MNAR problem. In the following, we will introduce three existing unbiased estimators.

2.2. Error Imputation Based Estimator

The error imputation based (EIB) estimator introduces an imputation model to compute the imputed error , i.e., the estimated value of the prediction error (eib; pmf-debias). Leveraging the imputed errors for unclicked events and the prediction errors for clicked events, we estimate the ideal loss function with the EIB estimator as

(3)

When the imputed error is accurate for any given unclicked event, the EIB estimator is unbiased, i.e., . However, the EIB estimator can hardly achieve accurate error imputation, and thus often has a large bias in practice, which would easily mislead the learning of the prediction model.

2.3. Inverse Propensity Score Estimator

The inverse propensity score (IPS) estimator (rat; ips-implicit-learn; gmcm) weights each clicked event with , where the propensity refers to the probability of the item being clicked by the user, i.e., the click-through rate (CTR) in the post-click conversion setting. By introducing an auxiliary CTR task to estimate the propensity , the IPS estimator can be formulated as

(4)

The IPS estimator derives an unbiased estimate of the ideal loss function, i.e., , when the estimated propensity is accurate for any given clicked event. However, as the clicked events merely account for a small part of , the CTR is typically assigned with a small value. Hence, the IPS estimator suffers from an especially severe high variance problem.

2.4. Doubly Robust Estimator

To address the large bias problem of the EIB estimator and the high variance problem of the IPS estimator, the doubly robust (DR) estimator is adopted by many previous works (dr; drjl; dr-ali). It combines the EIB estimator and the IPS estimator in a doubly robust way. Particularly, this estimator uses the imputed errors to estimate the prediction errors for all the events, and correct the error deviation for the unclicked events. The propensity is inversely weighted to the error deviation for eliminating the MNAR effect. The loss function of the DR estimator can be defined as

(5)

The DR estimator is unbiased, i.e., , if either the imputed error of any event or the propensity of any clicked event is accurate. This property is recognized as double robustness. To compute the imputed errors, previous works typically introduce a separate imputation model. Since the imputation learning is actually a regression problem, DR uses the squared loss,

(6)

to train the imputation model. The inverse propensity score is weighted to consider the MNAR effect, which also leads to the high variance problem of the imputation learning.

3. Enhanced Doubly Robust Learning Approach

Figure 1. Workflow of the double learning approach.

In this section, we elaborate the proposed enhanced doubly robust learning approach. We first analyze the bias and variance of the doubly robust estimator, based on which we propose the more robust doubly robust estimator for further variance reduction. Then, we detail the proposed novel double learning approach for the MRDR estimator.

3.1. Bias and Variance Analysis of DR Estimator

Initially, we formulate the bias of the DR estimator to prove its double robustness.

Theorem 3.1 ().

Let denote the additive error deviation, and the multiplicative propensity deviation. Then, the bias of the DR estimator is

(7)
Proof.

See Theorem 3.2 in (dr-ali) for the proof. ∎

As shown in Theorem 3.1, the DR estimator is close to the ideal loss function, i.e., , if either or , whereas the EIB estimator requires and the IPS estimator requires . This property is called double robustness. Then, we derive the variance of the DR estimator.

Theorem 3.2 ().

The variance of the DR estimator is

(8)
Proof.

For a single term of the DR estimator, its variance on the click indicator is

(9)

Then, summing across all terms of the DR estimator, we can derive the variance:

(10)

Similarly, we can derive the variance of the IPS estimator as

(11)

Theorem 3.2 and Equation 11 illustrate that the variance of both estimators depends on the estimated propensity, i.e., the predicted CTR , which may lead to a high variance problem. However, it is worth noting that the DR estimator still reduces the variance of the IPS estimator, if any given event satisfies .

3.2. More Robust Doubly Robust Estimator

The theoretical analysis demonstrates that despite the double robustness, the DR estimator suffers from the risk of increasing the variance of the IPS estimator under inaccurate error imputation. Hence, we propose a more robust doubly robust (MRDR) estimator for further variance reduction. Specifically, we propose to learn the imputation model of the DR estimator by minimizing its variance. In other words, it is a variation of the DR estimator, and the only difference is that its loss function for imputation learning is derived from minimizing DR’s variance. This means that the proposed MRDR estimator not only retains the double robustness, but also derives a lower variance than the original DR estimator. Based on Equation 9, we take the expectation on and estimate with to derive the loss function of the imputation learning in the MRDR estimator as

(12)

Comparing the loss function of imputation learning in MRDR with that in DR, we note that MRDR changes the weights from to , which has the property

(13)

As such, the MRDR estimator increases the penalty of the clicked events with low propensity, and decreases the penalty of the rest of the clicked events. In this way, the imputation model is learned better, which further enables MRDR to reduce the variance of the DR estimator.

3.3. Double Learning Approach

In this subsection, we detail the proposed double learning approach for the MRDR estimator. Given a vector

encoding all the features of user and item , previous works typically introduce two separate models: an imputation model estimates the imputed errors, and a prediction model learns from the imputed errors and true conversion labels to predict the CVR. Here, the imputation model is agnostic of the prediction model, and merely takes the user-item features for error imputation. In other words, during the learning process of the prediction model, the imputed error cannot be dynamically estimated. From an optimization perspective, the imputation model plays the role of estimating the gradients for the learning of the prediction model. However, we argue that simply utilizing model-agnostic methods is not sufficient to approximate such a model-correlated target. To this end, we propose a novel double learning approach, which utilizes the pseudo-labelling technique to provide dynamically changing imputed errors for the prediction model. As such, the complicated error imputation is simplified as a general CVR estimation task. We show the workflow of the double learning approach in Figure 1.

Specifically, we introduce two models with the same structure but different parameters: the prediction model and the imputation model . When we need to learn the prediction model, we first generate the pseudo label for each event based on the imputation model. Then, we estimate the imputed error by computing the cross entropy between the predicted conversion rate and the pseudo label , i.e., . In this way, the imputation model is converted to a CVR estimation task; further to this, the original regression problem is converted to a binary classification problem. Therefore, we replace the squared loss term in Equation 12 with a cross-entropy term. The imputation learning process of the MRDR estimator is thereby redesigned as

(14)

where denotes all the parameters of the imputation model and controls the regularization strength to prevent overfitting. Note that, although we change the original formulation of the loss function of the imputation model in MRDR, the idea of increasing the penalty of the low-propensity clicked events and decreasing the penalty of the rest is kept. Meanwhile, we formulate the learning of the prediction model as

(15)

where , and denote all the parameters of the prediction model , the prediction error, and the imputed error, and controls the regularization strength to prevent overfitting.

Inspired by Double DQN (doubleDQN)

, we redesign the learning approach of the both models. Generally, we alternate the learning process between the imputation model and the prediction model via mini-batch stochastic gradient descent. As such, two models regularize each other and jointly reach convergence. Since that the MRDR estimator merely enhances the inverse propensity weight of the imputation learning into

, it suffers from the high variance of the imputation learning, which also happens in the DR estimator as mentioned in Section 2.4. Therefore, each time before learning the imputation model, we update its parameters with those of the prediction model, i.e., . In this way, the imputation model will be periodically corrected, and the information that the enhanced inverse propensity weight brings is kept. We empirically demonstrate that such learning scheme is beneficial for eliminating the high variance problem of the imputation learning. We summarize the proposed enhanced doubly robust learning approach, named MRDR-DL, in Algorithm 1.

Input: , ,
Output:
1 Initialize the parameters ,
2 while stopping criteria is not satisfied do
3      
4       for number of steps for training the imputation model do
5             Sample a batch of clicked events from
6             Update by descending along the gradient
7            
8       end for
9      Generate pseudo label for any event
10       for number of steps for training the prediction model do
11             Sample a batch of events from 111
12             Update by descending along the gradient
13            
14       end for
15      
16 end while
Algorithm 1 The Proposed Enhanced Doubly Robust Learning Approach, MRDR-DL
11footnotetext: Due to the sparsity of the clicked events, we decrease the sample probability of the unclicked events in practice.

4. Semi-synthetic Experiments

Following previous works (rat; drjl; dr-cvr), we conduct semi-synthetic experiments to investigate the following research question (RQ).

  1. Does the MRDR estimator lead to more accurate loss estimation than other estimators?

4.1. Experimental Setup

4.1.1. Dataset and preprocessing

ML 100K Coat Shopping Yahoo! R3
#users 943 290 15400
#items 1682 300 1000
#MNAR ratings 100000 6960 311704
#MAR ratings 0 4640 54000
Table 1. Statistic details of the datasets.

To compute the accuracy of the estimated loss, we need a fully observed conversion label matrix, which is unavailable in real-world datasets. Thus, we create a semi-synthetic evaluation dataset using the MovieLens (ML) 100K

222https://grouplens.org/datasets/movielens/ (movielens) dataset in order to allow us to conduct the semi-synthetic experiment. The statistical details of the dataset are presented in Table 1. We employ the following preprocessing procedures (dr-cvr) to convert the explicit feedback setting to the post-click conversion setting, and derive a fully observed conversion label matrix and a click indicator matrix.

(1) Use matrix factorization (mf) to complete the rating matrix, but the predicted ratings are unrealistically high for all user-item pairs. To match a more realistic rating distribution given in (ratio), we sort all the ratings in ascending order, assign a value of 1 to the bottom fraction of the matrix entries, assign a value of 2 to the next fraction, and so on.

(2) Transform the predicted ratings into CTR with , where is set to 1 and is set to 0.5 in our experiments.

(3) Transform the predicted ratings into true CVR by correspondingly replacing the rating with the conversion rate . Note that we can only observe the binary conversion labels rather than the true values of the CVR in practice. Thus, we simply assign fixed values to them based on different predicted ratings.

(4) Sample the binary click indicator and conversion label with the Bernoulli sampling; that is,

(16)

where

denotes the Bernoulli distribution. Thereafter, we can derive a fully-observed conversion label matrix

and a click indicator matrix .

4.1.2. Experimental details

Given a predicted CVR matrix , we can directly compute the ideal loss by averaging the prediction error of each entry between and . In contrast, the estimators derive the estimated loss with partial entries in whose corresponding click indicators equal 1. To evaluate the performance of loss estimation, we use the following five predicted CVR matrices (rat; drjl) for comparison.

  • ONE: The predicted conversion rate is identical to the true CVR , except that randomly selected true CVR of 0.1 are flipped to 0.9.

  • THREE: Same as ONE, but flipping true CVR of 0.3 instead.

  • FIVE: Same as ONE, but flipping true CVR of 0.5 instead.

  • SKEW

    : The predicted conversion rate is sampled from the Gaussian distribution

    , and clipped to .

  • CRS: The predicted conversion rate if the true CVR . Otherwise, .

We compare the MRDR estimator with the naive, EIB, IPS, and DR estimators. We estimate the propensity as , where , and is set to 0.5 to introduce noises. For EIB and DR, the imputed error is computed as . For MRDR, we compute the imputed errors as .

4.1.3. Evaluation metric

We compare the performance of the five estimators by the relative error (RE) as

(17)

where denotes the estimator to be compared. RE evaluates the accuracy of the estimated loss, and a smaller value of MRE means a higher accuracy.

4.2. Experiment Results (RQ1)

naive EIB IPS DR MRDR
ONE 0.0686 0.5427 0.0346 0.0131 0.0073
THREE 0.0792 0.5869 0.0401 0.0172 0.0047
FIVE 0.1023 0.6152 0.0515 0.0138 0.0119
SKEW 0.0255 0.3574 0.0124 0.0081 0.0013
CRS 0.1773 0.0610 0.0888 0.0551 0.0503
Table 2. RE of the five estimators compared to the ideal loss.

In Table 2, we report the averaged RE of the five estimators over 20 times of sampling with Equation 16. We can see that IPS, DR and MRDR estimators outperform the naive estimator in every setting. This is caused by the selection bias that we introduce by controlling the propensity

. In contrast, the EIB estimator derives the worst RE in four settings; this is mainly due to the large bias of the heuristic error imputation. Additionally, the DR estimator improves the performance of the IPS estimator by jointly considering the imputed errors and the estimated propensities. Over all the settings, the MRDR estimator achieves the best performance, which can be attributed to both the double robustness and the reduced variance. Overall, the results conclude that our proposed method can achieve more accurate loss estimation. Next, we further evaluate our method on the task of CVR estimation on real-world datasets.

5. Real-world Experiments

DCG@K Recall@K
Datasets Methods K=2 K=4 K=6 K=2 K=4 K=6
Coat Shopping naive 0.66940.0136 0.94320.0138 1.13210.0126 0.80540.0159 1.39030.0225 1.89910.0233
IPS 0.70930.0232 0.95520.0223 1.12480.0217 0.82490.0298 1.35200.0353 1.80780.0399
DR-JL 0.67710.0273 0.92660.0282 1.09620.0272 0.79490.0337 1.32860.0420 1.78490.0456
MRDR-DL 0.72190.0211 0.99050.0204 1.16960.0217 0.84990.0265 1.42490.0321 1.90600.0430
Yahoo! R3 Naive 0.54690.0058 0.74660.0049 0.87140.0040 0.64790.0066 1.07450.0074 1.40980.0062
IPS 0.55020.0018 0.75200.0018 0.87510.0014 0.65450.0021 1.07970.0025 1.41680.0025
DR-JL 0.53100.0045 0.72730.0053 0.85120.0045 0.62920.0049 1.04950.0082 1.38220.0081
MRDR-DL 0.55610.0058 0.75490.0023 0.88110.0036 0.65950.0074 1.08460.0054 1.42370.0059
Table 3. A comparison of the overall performance of MRDR-DL and competing methods on two real-world datasets.

In this section, we compare the proposed learning approach with other existing debiasing approaches using real-world datasets. We anticipate the experimental results to answer the following RQs.

  1. Does the proposed approach MRDR-DL lead to higher debiasing performance than existing approaches?

  2. What influence do the various designs have on the proposed approach MRDR-DL?

  3. How does the sample ratio of unclicked events to clicked events influence the performance of MRDR-DL?

  4. How does the proposed double learning approach work for both the imputation model and the prediction model?

5.1. Experimental Setup

5.1.1. Datasets and preprocessing

To evaluate the performance of the unbiased CVR estimation, we need an MAR test set. However, as stated in (dr-ali), we cannot force users to randomly click items in order to generate unbiased data for CVR estimations. Previous work (dr-cvr) simulates the unbiased CVR estimation setting by using the datasets with specific properties. First, the datasets need to contain explicit feedback, which can reveal ground-truth user preference information. Next, the datasets need to contain an additional MAR test set, where users are asked to rate randomly selected sets of items. This enables us to evaluate the performance of the unbiased CVR estimation. To the best of our knowledge, there are only two publicly available datasets that satisfy these requirements, i.e., Coat Shopping333https://www.cs.cornell.edu/~schnabts/mnar and Yahoo! R3444http://webscope.sandbox.yahoo.com/. The statistical details for both datasets are presented in Table 1.

For both the MNAR data and the MAR data of both datasets, we follow (dr-cvr) and employ the following preprocessing procedure.

(1) We define the binary click indicator as if the item is rated by user , and otherwise.

(2) We define the binary conversion label as if the item is rated greater than or equal to 4 by user , and otherwise.

(3) We derive the post-click conversion dataset as .

For both datasets, we randomly split the MNAR datasets into training (90%) and validation (10%) sets, while the MAR datasets are kept as the test sets. Following the previous works (ips-implicit; dr-cvr), we filter out users who have no conversion records in the test set.

5.1.2. Baselines

We compare the proposed method with the following baselines:

  • Naive: It simply uses the naive estimator as the loss function to estimate CVR.

  • IPS (rat): It derives the IPS estimator as the loss function by estimating the CTR as the propensity score.

  • DR-JL (drjl): It utilizes the DR estimator by jointly learning the imputation model and prediction model.

Due to the high bias problem, the EIB estimator is widely recognized as a weak baseline (drjl; dr-cvr; rat), and thus is not included in our comparison. In our experiments, both the CTR and the CVR are estimated by factorization machine (FM) (fm).

5.1.3. Experimental Protocols

We adopt the mini-batch Adam to optimize all the methods, with the default learning rate set at 0.001. We fix the mini-batch size to 1024 for both datasets. In terms of FM, the embedding size is fixed as 64. We tune the regularization coefficient in the range of . Note that, for DR based methods, we apply a grid search when tuning the regularization coefficient of the imputation model and the prediction model; also, the sample ratio for unclicked events to clicked events is tuned in the range of . For CTR estimation, we fix the negative sampling ratio to 4.

For all the methods, we first choose the best hyper-parameters based on the validation set. Then, we perform the early stopping strategy (which applies if the model performance does not improve for five epochs) and report the best test result from the best-performing model on the validation set.

We use recall and discounted cumulative gain (DCG) to evaluate the debiasing performance of all the methods. We calculate both metrics for each user in the test set and report the average score.

5.2. Overall Performance (RQ2)

Table 3 shows the overall performance in terms of DCG@K and Recall@K (

) on two real-world datasets. To reduce the effect of randomness, we repeat the experiments 100 times for Coat Shopping and 20 times for Yahoo! R3, and then report the mean and standard deviation for each. The best results are highlighted in boldface. From the table, we can see that the debiasing methods, IPS and MRDR-DL, outperform Naive for both datasets, demonstrating the necessity of handling the selection bias in the CVR estimation. Meanwhile, we find that although DR-JL utilizes the unbiased DR estimator, it still gives the worst performance on both datasets. One possible explanation for this is that DR-JL is originally designed for debiasing explicit MNAR feedbacks; as such, its joint learning approach may not be applicable to CVR estimation. Overall, the proposed method MRDR-DL consistently outperforms other methods on both datasets, which verifies the effectiveness of both the proposed MRDR estimator and the double learning approach.

5.3. Ablation Study (RQ3)

DCG@K Recall@K
Datasets Methods K=2 K=3 K=4 K=5 K=6 K=2 K=3 K=4 K=5 K=6
Coat Shopping MRDR-DL 0.7219 0.8728 0.9905 1.0878 1.1695 0.8499 1.1518 1.4249 1.6765 1.9060
DR-DL 0.7205 0.8670 0.9806 1.0778 1.1601 0.8438 1.1368 1.4004 1.6517 1.8827
MRDR-JL 0.6948 0.8442 0.9613 1.0582 1.1442 0.8227 1.1215 1.3935 1.6439 1.8852
MRDR-DL with SL 0.7255 0.8720 0.9871 1.0827 1.1651 0.8504 1.1434 1.4107 1.6580 1.8892
Yahoo! R3 MRDR-DL 0.5561 0.6694 0.7549 0.8234 0.8811 0.6595 0.8860 1.0846 1.2616 1.4237
DR-DL 0.5463 0.6602 0.7459 0.8145 0.8714 0.6484 0.8762 1.0752 1.2525 1.4123
MRDR-JL 0.5546 0.6668 0.7544 0.8221 0.8786 0.6584 0.8828 1.0862 1.2612 1.4199
MRDR-DL with SL 0.5321 0.6439 0.7287 0.7963 0.8538 0.6298 0.8535 1.0503 1.2251 1.3863
Table 4. Ablation study of MRDR-DL on two real-world datasets.
Figure 2. Effect of the sample ratio of unclicked events to clicked events. ”All” means that we sample from all the events.

To apply the DR estimator to the post-click conversion setting, the proposed method, MRDR-DL, has specific design features. In this section, we will analyze their respective impacts on the method’s performance via an ablation study. The experimental results for MRDR-DL and its three variants on two datasets are summarized in Table 4. The results that are better than MRDR-DL are highlighted in boldface. We detail the variants and analyze their respective effects as follows.

(1) DR-DL: We replace the MRDR estimator with the DR estimator, i.e., we change the weights of the imputation learning from to . The results imply that enhancing the weights to adjust the penalty for the clicked events based on varying propensities is conducive to the variance reduction of the DR estimator, and further improves the performance of the prediction model.

(2) MRDR-JL: We alternate the training of the imputation model and the prediction model without sharing the parameters periodically (i.e., we skip Step 3 of the Algorithm 1). The experiment results verify the necessity of periodically correcting the imputation model based on the prediction model, which is empirically beneficial for eliminating the high variance problem of the imputation learning.

(3) MRDR-DL with SL: We replace the cross-entropy term of the imputation learning with the squared loss term, which is theoretically derived from the variance of the DR estimator. The results of the variant are consistent with MRDR-DL on Coat Shopping and significantly better than MRDR-DL on Yahoo! R3. One reason is that the squared loss aims at minimizing the deviation between imputed errors and true prediction errors, whereas the pseudo-label generation is essentially a binary classification problem. Hence, it is more intuitive to have cross entropy as the optimization goal.

5.4. Parameter Sensitivity Study (RQ4)

By jointly considering both clicked and unclicked events, DR based estimators can enjoy the double robustness. To investigate the impact of the unclicked events for the proposed MRDR-DL method, we vary the sample ratio of unclicked events to clicked events in the range of {0, 2, 4, 6, 8, All}. Here, ”All” means that the sample ratio is set to the maximum possible value, which is 12.5 for Coat Shopping and 49.4 for Yahoo! R3. Figure 2 shows DCG@K and Recall@K for MRDR-DL with respect to different sample ratios on both datasets. As we can see, MRDR-DL with a sample ratio of 0 (i.e., we merely sample from the clicked events) derives the worst performance in most settings. This shows that the well-learned imputation model enables the unclicked events to provide the prediction model with useful information. Furthermore, we find that sampling from all the events adversely hurts the performance of the prediction model, even though we should in theory. One reason for this might be that clicked events are typically sparse in the real-world datasets, meaning that we cannot ensure that the prediction model obtains sufficient information. For both datasets, the optimal sample ratio is around 4 to 8. Setting the sample ratio too conservatively or too aggressively may adversely affect the prediction performance.

5.5. Analysis of the Double Learning Approach (RQ5)

In this subsection, we will further investigate the proposed double learning approach. We plot the training curves of the prediction model and the imputation model of MRDR-DL on Coat Shopping in Figure 3. In the proposed method MRDR-DL, the prediction model aims at estimating CVR, while the imputation model aims at computing the imputed errors. Hence, we adopt DCG@4, and mean absolute error (MAE) for imputed errors and true prediction errors, respectively, to evaluate their testing performance. As shown in Figure 2, the training loss of the prediction model slightly fluctuates in the first 300 epochs before gradually reaching convergence. In contrast, the training curve of the imputation model is more stable. The reason for this is that the imputation model is not well-trained enough to provide the prediction model with sufficiently accurate information at the very beginning, whereas the imputation model is trained on clicked events with ground-truth labels. Further epochs of the double learning approach enable both models to exchange their information periodically. In this way, both models are jointly well-learned, reaching convergence together after about 900 epochs. Note that the testing curve of the prediction model fluctuates in the training process. This is reasonable because we train it with point-wise loss (i.e., cross entropy), whereas we evaluate it with a list-wise metric (i.e., DCG@4) when checking its debiasing performance.

Figure 3. Training curves for the prediction model and the imputation model of MRDR-DL on Coat Shopping.

6. Related Work

6.1. General Approaches to CVR Estimation

CVR estimation is a key component of the recommender system because it directly contributes to the final revenue. Due to the inherent similarity, CVR estimation typically refers to the advances made by the CTR prediction task and implicit recommendation in practice, such as traditional models (yd-poi; ctr-fm)

, deep learning based models

(ctr-deepfm; ctr-din; deepctr-hfm; ww-dl) and reinforcement learning based models (lx-rl1; lx-rl2; lx-rl3)

. However, few studies directly investigate the CVR estimation tasks. Previous works often employ traditional models such as logistic regression

(cvr-lr1; cvr-lr2)

and gradient boosting decision tree

(gbdt)

, while deep learning techniques like neural network

(esmm; esm2) and graph convolution network (gmcm; yd-gcn) are also adopted for CVR estimation. However, the selection bias issue is still underexplored, which has a significant influence on improving performance in practice.

6.2. Counterfactual Learning from MNAR Data

Most data for learning the recommender systems are MNAR, which is caused by various biases, including selection bias, conformity bias, exposure bias, etc (survey). Previous works typically adopt counterfactual learning methods to address these issues. Specifically, EIB methods (pmf-debias; eib) rely on a missing data model to model the missing mechanism. IPS methods employ logistic regression (rat)

, expectation-maximization algorithm

(wsdm21), and matrix completion (1bitmc) to estimate the propensity for correcting the mismatch between observed and unobserved data. DR methods (dr-ali; drjl) utilize an imputation model and a prediction model to jointly learn from the MNAR data. Other methods based on information bottleneck (cvib), meta learning (at), and causal embedding (cause; dice) have been also explored to address these biases. Among above, IPS and DR has been widely applied to the recommender systems. However, how to specify appropriate error imputation and propensity estimation is a critical issue affecting their unbiasedness, which needs to be resolved in the post-click conversion setting.

6.3. Selection Bias in CVR Estimation

Selection bias is ubiquitous in recommender systems, especially in the CVR estimation task. A few studies have investigated it, achieving effective results. ESMM (esmm) models both the CTR and CVR tasks, using muti-task learning to eliminate the selection bias issue in a heuristic way. Similarly, (esm2), which is also essentially biased, extends ESMM by introducing additional auxiliary tasks. In contrast, GMCM (gmcm) uses the IPS estimator to derive unbiased error evaluation when learning the CVR estimation task. In addition, Multi-IPW and Multi-DR (dr-ali) enjoy the unbiasedness of the IPS and DR estimator by learning both CTR and CVR tasks through multi-task learning. Although they consider the selection bias, the above approaches are evaluated using biased datasets; thus, their experimental results cannot be used to verify their debiasing performance, which is a widely-recognized limitation in practice. Furthermore, a recent work (dr-cvr) proposes to utilize the DR estimator for debiasing ranking metric with post-click conversions, which mainly concerns the evaluation of the recommender systems. In contrast, we focus on debiasing the learning of the CVR estimation, and we use two real-world datasets containing unbiased data to evaluate the debiasing performance.

7. Conclusion and Future Work

In this paper, we explore the problem of the selection bias in post-click conversion rate estimation. First, we analyze the bias and the variance of the DR estimator. Then, based on the theoretical analysis, we propose the more robust doubly robust estimator, which reduces the variance of the DR estimator while retaining the double robustness. Finally, we propose a novel double learning approach for MRDR estimator. It can dynamically utilize the information of the prediction model for the imputation model and empirically eliminate the high variance problem of the imputation learning. In the experiments, we verify the effectiveness of the proposed MRDR estimator on semi-synthetic datasets. In addition, we conduct extensive experiments on two real-world datasets to demonstrate the superiority of the proposed debiasing approach. For future work, we believe that the explainability (taert) of the debiasing approach warrants further investigation.

Acknowledgements.
This work is supported by National Natural Science Foundation of China (No.61976102, No.U19A2065 and No.61902145).

References