1 Introduction
Display advertising is a way of online advertising where advertisers pay publishers for placing graphical ads, video ads, and so forth on their websites. The conventional method of selling display advertising was a direct longterm contract between advertisers and publishers. Over the last decade, selling display ads via programmatic instantaneous auction called realtime bidding (RTB) has become common in performance display advertising Muthukrishnan2009AdER . Advertisers have been offered several payment options, such as paying per click (CPC) and paying per conversion (CPA). CPA has become predominant since conversion has more direct effects on advertisers’ return on investment (ROI) than a click does. Especially, it is also less susceptible to notorious online fraud Lu2017APF . Therefore, we consider a CPA model after clicking on an advertisement for which advertisers pay only if a user takes a predefined action on their website. A platform that supports such performancebased payment options needs to convert advertisers’ bids to the expected price per impression (eCPM). In a CPA model, eCPM depends on the probability that a click leads to a resulted conversion, i.e., the conversion rate (CVR). To seek the optimal price to bid for each impression, accurately predicting a CVR is essential, which results in an efficient marketplace McAfee2011TheDO .
Although ClickThrough Rate (CTR) prediction has been extensively studied and shown promising empirical results on benchmark datasets Chapelle2014SimpleAS ; cheng2016wide ; guo2017deepfm ; Juan2016FieldawareFM , it is difficult to directly apply these methods to CVR prediction. This is because a predictive model should be trained on fresh data to prevent data from becoming stale and to follow the seasonal trend. There are two troublesome difficulties in using fresh data for CVR prediction. First, unlike a click event, a conversion does not always occur right after clicking a display advertisement. While the time delay between an impression and its click is normally a few seconds, the gap between a click and its conversion is a few hours or even days. Consequently, some conversions that will occur have not taken place yet at the training step, and thus, are falsely considered as negative (PositiveUnlabeled problem). The second challenge is that the missing mechanism of conversion data is missingnotatrandom (MNAR). For example, conversions are much more likely to be missing if their clicks have just happened. In addition, decisive users are much more likely to convert right after the clicks than indecisive users. Thus, the probabilities of conversions being observed correctly are not uniform among samples. In the literature of covariate shift or causal inference, it is widely recognized that the MNAR mechanism can cause suboptimal and severely biased predictions (MNAR problem) imbens2015causal ; schnabel2016recommendations ; Shimodaira2000ImprovingPI ; sugiyama2007iwcv ; Sugiyama2007DirectIE .
Works have already been conducted to address this delayed feedback issue. chapelle2014modeling assumes that the delay distribution is exponential and proposes two models. One predicts the conversion, and the other predicts the delay in conversion. The two models are jointly trained via the EM algorithm or the gradient descent optimization. In addition, Yoshikawa2018AND
extends this study and proposes to use a nonparametric model for the estimation of the delay distribution. The work that is most related to ours is
Ktena2019AddressingDF . This study compares five different loss functions to alleviate the bias. In particular, they introduce an inverse propensity weighting estimator in causal inference and positiveunlabeled learning as two separate approaches to the delayed feedback problem. However, the delayed feedback setting contains both the positiveunlabeled and the MNAR problems as we discussed above, and a method that simultaneously solves the two challenging problems has not been proposed yet.To address the two major challenges, we first propose an unbiased estimator for the loss function of interest that can be estimated from the observable data. The proposed estimator is based on the combination of the estimation methods in causal inference and positiveunlabeled learning rubin1974estimating ; rosenbaum1983central ; imbens2015causal ; bekker2018beyond . The proposed estimator weights each observed sample using the parameter called propensity score. There is, however, a difficulty that true propensity scores are unknown in nature; thus, they have to be estimated. To accurately estimate the propensity score, we subsequently show that the unbiased propensity estimation is possible by weighing each sample using their conversion rate. Based on these observations, we propose an algorithm called Dual Learning Algorithm for Delayed Feedback (DLADF) motivated by the work in the field of unbiased learningtorank ai2018unbiased . The proposed algorithm jointly learns CVR predictor and propensity score estimator alternately only from observed conversion data and is the first method to address the PU and MNAR problems simultaneously in a theoretically principal way.
Furthermore, to evaluate the efficacy of the counterfactual approach in the delayed feedback setting, we conducted an experiment where the delayed feedback situation is simulated. The experimental result demonstrates that the proposed algorithm outperforms the existing baselines in most cases, and the unbiased estimation approach is valid in the delayed feedback setting.
The rest of the paper is as follows. In Section 2, we formulate the delayed feedback problem. In Section 3, we explain the detail of the proposed method and provide the statistical properties of the proposed estimators. The experimental setup and results are described in Section 4. We conclude this paper with a summary of contributions and future research directions in Section 5.
2 Problem Setting
In this section, we introduce some notations and formulate the delayed feedback setting.
Given a set of units indexed by , we denote
as the feature vector for each unit
. Letbe a random variable representing true conversion information. If an individual
will convert then , otherwise, . In the delayed feedback setting, the true conversion variables are not fully observable because of the conversion delay. To precisely formulate this delayed feedback setting, we introduce another binary random variable . This random variable represents whether the true outcome is observed or not, which depends on the elapsed time from clicks. If , then the conversion is observed; otherwise, the conversion is not observed. Using these random variables, we can represent an observed outcome indicator as follows:(1) 
If we have observed the conversion of , then , otherwise, . Note that the true conversion indicator is not always equal to the observed conversion indicator , and the conversion of is observable only when the unit will convert and the true outcome is observable (i.e., ).
Throughout the paper, we assume that the following assumption holds.
(2) 
This assumption is called Unconfoundedness in the context of causal inference imbens2015causal ; rosenbaum1983central ; rubin1974estimating , which means that features affecting both and are fully observed. From this assumption, we obtain the following equation relating the true conversion rate and the observed conversion rate.
(3) 
where we denote as and as , respectively.
The goal of this paper is to obtain a hypothesis that well predicts the true conversion rate . To achieve this goal, we define the ideal loss function that should be optimized to obtain a wellperforming hypothesis as follows.
Definition 1.
(Ideal loss function for conversion rate prediction) The ideal loss function to be optimized to obtain a hypothesis is defined as follows.
(4) 
where the functions and characterize the loss function. For example, when these functions are defined as follows, then Eq. (4) is called the binary cross entropy loss.
The ideal loss function in Eq. (4) is defined using the true conversion indicator
. In standard supervised machine learning setting, model parameters of
are obtained by the empirical risk minimization framework as follows Mohri:2012:FML:2371238 .where is a space of realvalued functions called the hypothesis space. Here, is the empirical average estimator for the ideal loss function and is unbiased (i.e., ). However, in the delayed feedback setting, the true conversion indicators are unobserved, and thus, the empirical risk minimization is infeasible in nature. Therefore, the critical component of the delayed feedback problem is to estimate the ideal loss function using only observable variables.
3 Proposed Method
In this section, we propose a dual learning framework inspired by a learning procedure of unbiased learningtorank ai2018unbiased . The proposed framework treats CVR and propensity score estimation problems as a dual problem and jointly learns two predictors only from observed conversions.
3.1 Unbiased Conversion Rate Prediction
First, we formally define the propensity score for the delayed feedback setting as follows.
Definition 2.
(Propensity score) The propensity score for the delayed feedback setting is defined as
(5) 
The propensity score for the delayed feedback setting is interpreted as the probability of each conversion being correctly observed. In the context of causal inference, the unbiased estimator for the causal effects of treatments can be derived by weighting each sample by the inverse of the propensity score rubin1974estimating ; rosenbaum1983central ; imbens2015causal . However, in the delayed feedback setting, observation indicators are unobserved, and the inverse propensity score (IPS) estimation technique cannot be applied directly. Thus, we combine the IPS estimator with the estimation technique in the field of positiveunlabeled learning Elkan2008LearningCF ; bekker2018beyond and define the IPS estimator for the delayed feedback setting as follows.
Definition 3.
(IPS estimator for the delay feedback setting) Given propensity scores, the IPS estimator for the ideal loss function in Eq. (4) is defined as
(6) 
The following proposition formally proves that the IPS estimator is unbiased against the ideal loss function.
Proposition 1.
(Unbiasedness of the IPS estimator) The IPS estimator in Eq. (6) is unbiased against the ideal loss function in Eq. (4).
Proof.
∎
Proposition 1 validates that the proposed IPS estimator is unbiased against the ideal loss, and thus, the unbiased training of the true conversion rate predictor is feasible in the delayed feedback setting.
3.2 Unbiased Propensity Estimation
The unbiasedness stated in Proposition 1 is desirable to obtain a conversion rate predictor, but is dependent on the availability of the true propensity score. In general, estimation of the propensity score in the IPS estimator can be formulated as the classification problem westreich2010propensity ; lee2010improving , but, the indicator variables are never observable in the delayed feedback setting. Therefore, we propose a method to unbiasedly estimate the propensity score from observed conversion indicators.
We first define the ideal loss function for the propensity estimation as follows.
Definition 4.
(Ideal loss function for the propensity estimation) The ideal loss function for the propensity estimation is defined as follows.
(7) 
where is a hypothesis that estimates the propensity score.
Then we define the Inverse Conversion Rate (ICVR) estimator having the same structure as the IPS estimator below.
Definition 5.
(ICVR estimator) Given conversion rates, the ICVR estimator for the ideal loss function in Eq. (7) is defined as
(8) 
Following the same logic flow in Proposition 1, the next proposition proves that the ICVR estimator is unbiased against the ideal loss function for the propensity estimation.
Proposition 2.
(Unbiasedness of ICVR estimator) The ICVR estimator in Eq. (8) is unbiased against the ideal loss function in Eq. (7).
Proof.
∎
Proposition 2 indicates that the proposed ICVR estimator is unbiased against the ideal loss for the propensity estimation, and thus, a better conversion rate predictor leads to a wellperforming propensity estimator and vice versa. Based on these unbiased estimators, we propose Dual Learning Algorithm for Delayed Feedback that jointly trains the propensity estimator and the conversion rate predictor with observed conversion data.
3.3 Dual Learning Algorithm for Delayed Feedback
Here we state the proposed DLADF algorithm. The basic idea behind the algorithm is to estimate the propensity score and the true conversion rate using the unbiased estimators simultaneously.
First, given a propensity estimator parameterized by , the loss function to derive the parameter of a conversion rate predictor is given below.
(9) 
Next, given a conversion rate predictor parameterized by , the loss function to derive the parameter of a propensity estimator is given below.
(10) 
The detailed procedure of the proposed DLADF is described in Algorithm 1.
3.4 Statistical Properties
In this subsection, we state some statistical properties of the proposed estimators. Note that the formal proofs can be found in the supplementary material.
Theorem 1.
(Variance) Given sets of independent random variables
, propensity scores , and a conversion rate predictor , the variance of the IPS estimator isReplacing , and for , and provides the variance of the ICVR estimator.
Proposition 3.
(Estimation Error Tail Bound) Under the same assumption as in Theorem 1, for any , the following inequality holds with a probability of at least
Replacing , and for , and provides the estimation error tail bound of the ICVR estimator.
The RHS of both the variance and the estimation error tail bound depending on the inverse of the propensity scores. Thus, these bounds can be loose, especially when there exists a severe delay. Moreover, the analysis implies that applying a variance reduction technique to the unbiased estimators might improve the statistical property of the estimator by reducing its variance at the cost of introducing some bias. From the implications above, we utilized the following nonnegative estimator inspired by the work in positiveunlabeled learning kiryo2017positive as follows:
Definition 6.
(Nonnegative estimator) When propensity scores and a constant are given, then the nonnegative estimator is defined as
(11) 
A larger value of
reduces a variance of the estimator at the cost of introducing some bias. We explore the effect of different values of the hyperparameter
on the performance of the conversion rate predictor.4 Experimental Results
In this section, we provide an empirical comparison of the proposed method and the baseline methods using synthetic dataset.
4.1 Experimental Setup
Synthetic data generation procedure: We created a synthetic dataset simulating the delayed feedback setting in display advertisement to evaluate the performance of the methods for the delayed feedback problem. The components of the synthetic data are as follows:

The number of click events , and the number of features observed for each event . We set and , respectively.

The distribution of the feature vectors
. We drew oddnumbered features independently from a Gaussian distribution with a standard deviation of
. In contrast, we drew evennumbered features independently from a Bernoulli distribution with the parameter of
. 
The training period , which can be varied depending on the experimental condition.

The timestamps of clicks ts_click
sampled from the uniform distribution.

The delay between the click and the conversion . Following the probabilistic model used in chapelle2014modeling , we assumed that the distribution of the delay is exponential.

Coefficient vectors generating the true conversion rate
and the parameter of the exponential distribution
. Both coefficient vectors were sampled from a Gaussian distribution with a standard deviation of .
To summarize, our data generation model is,
where represents ’s timestamp of conversion. If this timestamp exceeds the end of the training period , this conversion is not correctly observed, (i.e., ).
In the experiments, we used 25 randomly sampled realizations.
Baselines and the proposed method: We compared the performance of the following methods.

Oracle: The logistic regression model trained with the true conversion data that is, in reality, unobservable. The performance of the oracle model is the best achievable prediction performance.

Naive: The logistic regression model naively trained with the observed conversion data.

nnPU: The logistic regression model trained with the following loss function.
This is the positiveunlabeled loss was proposed in ktena2019addressing . We applied the nonnegative variance reduction technique to this loss function for ensuring fair comparison.

Delayed Feedback Model (DFM): This model was proposed in chapelle2014modeling
, and we implemented it in the Tensorflow environment. We obtained the model parameters of DFM following the joint optimization procedure described in Section 4.2 of
chapelle2014modeling . 
nnDLADF: This is our proposed method. We used the logistic regression model for both conversion rate predictor () and propensity estimator (). Both estimators were trained with the nonnegative loss function defined in Eq. (11).
4.2 Results
Here we report the results of the experiment.
First, Figure 1(b) shows the logloss on the test sets relative to the performance of the oracle model. Note that the values of the training period were set as (days). A smaller value of introduces smaller propensities, as shown in Figure 1(a), and this leads to a larger bias in observed data. The results show that the proposed method significantly outperformed the other methods when . In contrast, the benefit of DLADF was slight when , but, it was not largely outperformed by the other method in all the settings, which suggests the stable prediction performance of the proposed algorithm.
Figure 1(c) shows the effect of the hyperparameter in Eq. (11), which controls the biasvariance tradeoff of estimators for ideal loss functions. were set as . A small reduces the variance of the estimator at the cost of introducing bias, in contrast, a large has little bias, but the variance can be huge. The figure shows the performance of the DLADF algorithm with different values of relative to . It was observed that it is possible to improve the log loss by approximately 2.5% by setting as or by appropriately controlling the biasvariance tradeoff.
5 Conclusion
In this study, we explored the delayed feedback problem where the true conversion indicators are not fully observable due to the conversion delay. To address the problem, we developed a framework called Dual Learning Algorithm for Delayed Feedback that trains a conversion rate predictor and a propensity estimator alternately. In the empirical evaluation using the synthetic data, the proposed algorithm generally outperformed the existing baseline methods with respect to the logloss. The results also suggest the benefit of the unbiased estimation approach in the delayed feedback setting.
Important future research directions are the theoretical analysis of the variance reduction technique and the empirical comparison using realworld datasets.
Acknowledgments. The work was done during YS’s internship at CyberAgent, Inc., AI Lab. The authors would like to thank Daisuke Moriwaki and Kazuki Taniguchi for their helpful comments and discussions.
References
 [1] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 385–394. ACM, 2018.
 [2] Jessa Bekker and Jesse Davis. Beyond the selected completely at random assumption for learning from positive and unlabeled data. arXiv preprint arXiv:1809.03207, 2018.
 [3] Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105. ACM, 2014.
 [4] Olivier Chapelle, Eren Manavoglu, and Rómer Rosales. Simple and scalable response prediction for display advertising. ACM TIST, 5:61:1–61:34, 2014.

[5]
HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.
Wide & deep learning for recommender systems.
In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10. ACM, 2016. 
[6]
Charles Elkan and Keith Noto.
Learning classifiers from only positive and unlabeled data.
In KDD, 2008. 
[7]
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He.
Deepfm: a factorizationmachine based neural network for ctr prediction.
InProceedings of the 26th International Joint Conference on Artificial Intelligence
, pages 1725–1731. AAAI Press, 2017.  [8] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
 [9] YuChin Juan, Yong Zhuang, WeiSheng Chin, and ChihJen Lin. Fieldaware factorization machines for ctr prediction. In RecSys, 2016.
 [10] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positiveunlabeled learning with nonnegative risk estimator. In Advances in neural information processing systems, pages 1675–1685, 2017.
 [11] Sofia Ira Ktena, Alykhan Tejani, Lucas Theis, Pranay Kumar Myana, Deepak Dilipkumar, Ferenc Huszár, Steven Yoo, and Wenzhe Shi. Addressing delayed feedback for continuous training with neural networks in ctr prediction. ArXiv, abs/1907.06558, 2019.
 [12] Sofia Ira Ktena, Alykhan Tejani, Lucas Theis, Pranay Kumar Myana, Deepak Dilipkumar, Ferenc Huszar, Steven Yoo, and Wenzhe Shi. Addressing delayed feedback for continuous training with neural networks in ctr prediction. arXiv preprint arXiv:1907.06558, 2019.
 [13] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting using machine learning. Statistics in medicine, 29(3):337–346, 2010.
 [14] Quan Lu, Shengjun Pan, Liang Wang, Junwei Pan, Fengdan Wan, and Hongxia Yang. A practical framework of conversion rate prediction for online display advertising. In ADKDD@KDD, 2017.
 [15] R. Preston McAfee. The design of advertising exchanges. Review of Industrial Organization, 39:169–185, 2011.
 [16] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
 [17] S. Muthukrishnan. Ad exchanges: Research issues. In WINE, 2009.
 [18] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 [19] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
 [20] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In International Conference on Machine Learning, pages 1670–1679, 2016.
 [21] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. 2000.
 [22] Masashi Sugiyama, Matthias Krauledat, and KlausRobert Muller. Covariate shift adaptation by importance weighted cross validation. In Journal of Machine Learning Research 8, pages 985–1005, 2007.
 [23] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, 2007.
 [24] Daniel Westreich, Justin Lessler, and Michele Jonsson Funk. Propensity score estimation: machine learning and classification methods as alternatives to logistic regression. Journal of clinical epidemiology, 63(8):826, 2010.
 [25] Yuya Yoshikawa and Yusaku Imai. A nonparametric delayed feedback model for conversion rate prediction. ArXiv, abs/1802.00255, 2018.
Appendix A Omitted Proofs
a.1 Proof of Theorem 1
Proof.
First we define,
(12) 
Subsequently, can be written as
By Proposition 1,
where, . Then,
Next (b) is calculated as,
Therefore,
From the assumption, is a set of independent random variables. Thus,
∎
a.2 Hoeffding’s Inequality
Lemma 1.
(Hoeffdings Inequality) Independent bounded random variables that take values in intervals of sizes satisfy the following inequality for any .
(13) 
Comments
There are no comments yet.