1 Introduction
For decades, the advertising industry has used reach or number of users who viewed an advertisement, to estimate the effectiveness of ads and to price advertising opportunities. This covered all forms of advertising including advertising on billboards, newspapers and television. Online advertising also started out following this model, charging advertisers for the number of times their ad was viewed, and even now such arrangements remain popular, especially for brand advertising.
However, increasingly, the growing sophistication of online advertising platforms allows advertisers to automatically target deeper goals as well. In the 1990s, we saw the emergence of costperclick advertising and finergrained targeting, so advertisers would only be charged if a user clicked on their ad, and were able to manually assign distinct values to different segments of users. Combined with realtime bidding (RTB) for deciding which ad to show, this necessitated the creation of clickthrough rate prediction models to compute the probability of a click on a particular impression for a particular advertiser.
In recent years, we have seen two new, synergistic emerging trends. First, increasing automation in targeting means that while advertisers would previously need to specify the exact keywords or sets of users they wish to see their ads, now advertising platforms are increasingly able to automatically find impressions that would contribute to satisfying the advertisers’ objectives, with significantly reduced manual targeting configuration. At the same time, platforms are increasingly allowing advertisers to provide more details about their objectives by specifying what events (for example, purchases) they value after a user has clicked.
One simple advertising campaign setup for achieving this is for the advertiser to specify the events they care about, and for the campaign objective to be to maximize the number of events that are done by users who click on advertisements, under the constraints of maximum total spend and average cost per event. A more sophisticated form of this configuration is to additionally specify a value for each event that occurs, and for the objective to be to maximize the overall value delivered by the advertising campaign. On the advertising platform side, optimizing for these objectives then requires the prediction of number or value of postclick events (commonly called “conversions”), in addition to the prediction of the clickthrough rate as before.
Advertisers can set up once per click (OPC) or many per click (MPC) types of conversions. In the OPC case, at most one conversion following the click will be counted, while in the MPC case, a click can have any number of conversions as long as they occur within the postclick window. The duration of the postclick attribution window is set by the advertiser and can be as low as 2 hours or as long as 90 days.
For example, a ride sharing app may have a campaign that optimizes only for users who actually go on to take a first ride (OPC). On the other hand, a puzzle game could report a conversion whenever a user reaches level 20 in the game (OPC) or makes an inapp purchase (MPC).
Building a conversion optimizer model to estimate the expected number of conversions for MPC or OPC/MPC mixed campaigns raises several challenges that are absent in estimating the clickthrough rate of pure OPC conversions.
1. The events that are being predicted can take up to 90 days to be reported.
2. Unlike clicks or OPC conversions, there can be (and often are) multiple events associated with each click or impression. This also increases the difficulty of the previous challenge, since we cannot be sure that we have seen all events attributable to the click until the end of the attribution window.
3. Since events can be defined by advertisers in arbitrary ways, their distributions are highly heterogeneous, both in terms of the rate and delay of events.
4.The environment is nonstationary, with user behavior changing in response to changes on advertiser apps or websites, advertisers sometimes changing their optimization objectives, and new advertising campaigns with new objectives being created. In particular, for new campaigns, very little generalization is possible, since we do not know a priori how the events will be defined.
To address this last concern of nonstationarity, it is common practice to use online training for models, training them on data that is as close to realtime as possible. In this context, however, the problem of conversion delay becomes especially acute: when training close to real time, the model misses any conversions that would arrive after its training delay, and so sees systematically fewer conversions than will ultimately happen. Although the conversion delay problem is also a challenge for batch training (as some portion of examples will still be immature), we focus on the online training setting in this paper. Our results trivially generalize to batch training.
In this work, we develop a model for the expected number of conversions that is able to train close to real time, while achieving neutral longterm bias of predictions, largely eliminating the tradeoff between the accuracy gained through short training delay and the bias in longerdelay examples. To do so, we model the expected distribution of conversion delays in a nonparametric way, while taking advantage of any available intermediate information to improve the accuracy of predictions.
2 Related work on delay modeling
Delay modeling is closely related to the problem of censored feedback. There has been an extensive literature on this topic and one can find related work in Elkan and Noto (2008) and it’s references. While this literature is quite relevant it’s not directly applicable. On the other hand there have also been multiple papers such as Agarwal et al. (2010); Menon et al. (2011); Lee et al. (2012); Rosales et al. (2012) studying conversion optimization, but these do not address the problem of delayed feedback in labels.
The first formal study of handling delayed feedback in conversion optimizer was initiated by Chapelle (2014). The results in this paper, as well as all following papers on this problem have been restricted to the special case of at most one conversion per click. The solution proposed in Chapelle (2014)
uses many of the tools in survival analysis. At a high level the authors train a model to estimate the conversion delay (time from click to conversion) assuming that it follows an exponential distribution. They then compute the probability of observing a conversion given the click age as a function of the probability of conversion and the delay distribution using Bayes’ rule. Finally, they train a model for the probability of observing a conversion given the click age on the realized label.
There have been several follow up papers to Chapelle (2014) in the same restricted setting. Hubbard et al. (2019)
improves the result by assuming that the delay distribution is a geometric distribution with a beta prior,
Ktena et al. (2019)studies the problem with different types of loss functions such as inverse propensity scoring and importance sampling,
Yoshikawa and Imai (2018) considers a nonparametric family of delay distributions which is a weighted sum of kernel values, Mann et al. (2019) studies it in the adversarial onlinelearning framework, Ji et al. (2017) studies it with the Weibull family of distributions, Safari et al. (2017) gives a biased estimator for improving efficiency, and Vernade et al. (2017) studies it in the bandit setting. These solutions aren’t able to handle the more general case of multiple conversions per click.There are recent works handling this problem without the assumption of a parametric delay distribution. Saito et al. (2020) proposed a dual learning algorithm of the CVR predictor and bias estimator. Yasui et al. (2020) addressed this problem by using an importance weighting approach typically used for covariate shift correction. Su et al. (2020) calibrated the delay model by learning a dynamic hazard function with the postclick data, and Kato and Yasui (2020) proposed a method with an unbiased and convex empirical risk constructed from samples. These works also do not extend to the MPC case.
The first paper to study handling multiple conversions per click was Choi et al. (2020). They extend Chapelle (2014)
by assuming that conversions come as a negative binomial distribution with exponentially distributed time delays between them. There are two challenges with this approach. The first is that it works only for integer conversions and cannot easily extend to predicting the expected value where the label is float. The second, as noted in the paper, is the loss function they define is nonconvex and the model can either predict that the data has high conversion rate and long delay or low conversion rate and short delay. Empirically, this works out fine in batch training as the model sees some mature data in each batch to resolve this issue. Unfortunately, in online learning this is not the case, as the model stops seeing mature data when training on more recent examples. This makes nonparametric methods like the one proposed in this paper more robust in these settings.
3 Preliminaries
We can more formally describe the problem as follows:
Our model trains on examples in the time range in sequence, visiting each example once. Each example is associated with two types of features. Features are available at the time the example would have been predicted on in serving, while features are delayed and may change from time (the time at which we would have been required to make a prediction on this example) to time for a fixed , after which we disregard any further updates to the example. The label, can be considered a subtype of features . For convenience, we will say that the label is the total number or value of individual events that become visible at times
, although our approach could also be used for fully continuous response variables.
Finally, at time , we see an example with features , and must make a prediction for it using model parameters . We desire this prediction to be accurate and unbiased under the assumption that the timedistribution of the label and features conditional on the input features is relatively stationary.
4 Model
4.1 Core ideas
The fundamental goal of our approach is to model the expected delay distribution for each particular example. From this submodel, we want to receive a prediction, such that when we see an incomplete label (i.e. one which may still be updated upwards over time), we can predict what the label will be when it is complete, and then use this completed label for training the main model.
To model the delay distribution, we introduce a novel model configuration, based on three core ideas.
Split the label into different delay buckets
Say we have a training example that came from time and want to train on it at time , . Then, we know the portion of the label that occurred in , and in order to avoid bias in training, need to predict the “remaining” portion of the label for the time period .
To make this prediction, there are two reasonable approaches that could be taken: either we can attempt to make a model that can make this prediction for arbitrary values of or choose a number of fixed values at which we’re able to make predictions. The first approach requires either a parametric distribution assumption, which is undesirable due the heterogeneity of our delay distribution, or the use of a detailed time series model. So, we elect to use the second approach to avoid unnecessary complexity.
To do so, we interpret the overall label as a sum of conversions that fall in different delay buckets . Then we can create a set of submodels to predict the label in each bucket , where , and train each model only on examples whose age is at least (i.e. examples for which the label of the bucket is complete).
Then, if at time we wish to estimate the complete label for example , we can find the latest submodel the beginning of whose prediction interval precedes time , and compute the total expected label as the sum of the truncated known label part and the predicted label part: . It’s easy to see that is an unbiased estimator for if and only if each submodel is an unbiased estimator for the label part on its own interval.
Thermometer encoding of the labels
Observe that in the above setup, any observed events are discarded when computing the predicted label, because they overlap with the time bucket of a submodel that we need to use. Because of this, it’s important to have a sufficiently large number of submodels that the proportion of such discarded events is sufficiently small. However, as we increase the number of models, each one is responsible for shorter and short intervals, and so has fewer and fewer positive labels.
To address this label sparsity issue, we introduce a thermometer encoding of the label: instead of separating the label into nonoverlapping buckets as above, we separate it into overlapping buckets , so that the label for each submodel is the number of events occurring from its beginning time to the end of the full time period. This way, almost regardless of how many submodels we use, each will still cover a reasonable time range, and avoid label sparsity.
Using this label encoding also reduces the cost of inference. If each submodel is responsible for a narrow time bucket, we must evaluate the submodels covering all following time periods to complete the label or evaluate all submodels to get the full time range prediction. By using thermometer encoding, we need to evaluate only a single one.
Auxiliary information
The third key idea is to use the delayed features as submodel inputs to improve the accuracy and bias of the label completion. In particular, we provide as an input the “label so far”, i.e. a feature describing the labels in , to each submodel . We can draw an analogy with conditional entropy to note that , so conditioning our prediction on the label so far will reduce its entropy when they have high mutual information, which is shown in Figure 1.
In particular, note that providing this auxiliary information helps support the assumption that the delay distribution should be stationary in time conditional on the model inputs. For example, if user behaviors where the first event comes later become more frequent over time, the submodels could correctly adjust for this drift when performing label completion even if the change is not associated with the standard input features.
Observe also that any auxiliary information we use in training must also be available when performing inference on the submodel. This requirement works well with thermometer label encoding: we can provide submodel any information available up to , since we only use it for inference after that point in time. If, however, we wanted to provide auxiliary information to models using bucket encoding, we could only provide information available up to , since we may use every model to complete an “early” partial label.
4.2 Setup details
We give exact training setup succinctly with a list of models which are trained, their features/labels and examples that they train on in Table 1. This model setup requires careful handling of several details.
First, in the training regime described, where examples are visited once sequentially, observe that for an example at time , , only submodel predictions will be trained. As a result, if all predictions are coming from a single model, predictions which are not being trained may be affected by those that are, and the model may experience catastrophic forgetting. To mitigate this effect in this training regime, it is desirable to have fully separate submodels for each of the predictions.
This then adds an extra constraint when deciding the number of buckets for the labels, and so the number of predictions. If we choose a low number of buckets, we may lose information, since when training, we discard label parts that occur partway through a bucket (and instead substitute a prediction for them). If we choose a larger number of buckets, the training cost of the overall model increases. In practice, we have found that 310 buckets can be a reasonable choice.
Model  Objective  Label  Examples to train  Features 

…  …  …  …  … 
5 Experimental Evaluation
In this section we will describe the exact setup used in the evaluation.
Dataset  The specific dataset on which we evaluate our solution is app install ads from a commercial mobile app store. Here, all examples are ad clicks and the label is postclick events which happen in the app after the user installs the app. For example, for a ride share app, this could be the total number of rides in a specified time window that the user purchases after installing the app. Like Chapelle (2014) we assume last click attribution where any events are attributed to the user’s most recent ad click.
Features and models 
We use several categorical features. We embed these categorical features in a dense vector space and take these embeddings and pass them through a fully connected deep neural network. We have several fully connected layers in the neural network.
Optimizer and loss function  We use the AdaGrad optimizer. We cast the regression problem as a Poisson regression Adrian Colin Cameron (1998)
which is one of the standard ways of predicting the expected value of count data, but our results should generalize to other regression formulations. We also tune all the hyperparameters to optimize for maximizing likelihood when trained on mature labels. We use an ensemble of models for each variant to maintain reproducibility of results.
Shamir and Coviello (2020)Online training  We use online training which has been quite popular within many internet companies McMahan et al. (2013) and handles the nonstationarity of data very well. In online training we first start training on the oldest examples and then train on examples in the order of their timestamps until we have completed one pass through the data. We may then continue training on new data as it becomes available as a result of new user interactions. To evaluate model performance during training, we first evaluate performance on each example and then train the model on that example. This gives us an estimate of how the model may have performed if we had chosen to use it in production at any given point in time.
5.1 Models compared
We will compare our new solution against several natural baselines. Along with this we also list several ablations that we conduct on different aspects of the design in our solution.
M1: Neglecting delay  In this model we train on all data and naively use the events or value available after a minimal delay (less than 1 day from click time) as the label.
M2: Train on different delays  This set of models trains on all data with each model using a different delay from wall time, progressively seeing more of the label and lessrecent examples as the delay increases.
M3: Train only on mature data  Here we throw away most recent examples, for which the label is immature, and only train on examples for which label is completely mature.
M4: Remove thermometer encoding  We train a model variant where we don’t use the proposed thermometer encoding technique. Instead, each submodel predicts the label in the window and to obtain the final prediction, we sum the predictions of these submodels. While this would make the model more expensive to serve, we also show that this makes the accuracy of the predictions worse.
M5: Remove auxiliary information  In this variant we remove the auxiliary information we provide to the submodels (the label observed up to the submodel’s prediction period) and use only the features which are available at serving time.
Oracle: Train on complete labels  In this variant we train a Poisson Regression model on complete labels assuming no delay. This is impossible in practice, since the label will be incomplete until the attribution window is passed. It’s used to represent an upperbound for this prediction task.
5.2 Results
While conversions aggregated over all advertisers may look approximately exponentially distributed, at a peradvertiser level they need not follow any particular class of distributions. Figure 3 shows the delay distribution of the data used for evaluation for a specific advertiser, with the Yaxis indicating the number of conversions within the time bucket and the Xaxis indicating the progression of time from the observed click.
We evaluate the different variants described above on the data, and compare performance along two dimensions: accuracy and bias. Accuracy is represented by the negative Poisson loglikelihood or the Poisson log loss, since the models considered here are Poisson models predicting conversions per click. The lower the negative Poisson loglikelihood, the more accurate the model variant is at predicting conversions compared to the true final label. The bias defined as the model prediction divided by the actual true observed label when all the post install conversions have arrived (which we define as “mature”). Note that the models train on data as it arrives (hence they see only a partiallycomplete label, depending on the training delay), while during evaluation, we compare the predictions to the true mature label that is attributed to the click at the end of the attribution period. Along with evaluating the models on all data, we also consider slices for new advertising campaigns and from long delay campaigns to show that the proposed model has better performance on partial data compared to the proposed baselines. While we give numbers for improvement in Poisson log loss, we only give qualitative results for bias due to propritary nature of the dataset. We just note that the bias of the new model is showing that it is completely calibrated.
Figure 5 shows prediction bias on all data. As expected, we see that M1 has a significant negative bias since it trains only on a fraction of labels. While all other models have better bias than M1, the proposed model is closest to neutral bias. We can also see that all models are the same when training over mature data, but as each model starts training on examples closer to current time, their performance start diverging due to the sequential nature of training. The models without auxiliary features and thermometer encoding start overestimating labels due to absence of intermediate information and have a positive bias. While the model that trains only on mature data (M3) performs closest to the proposed model on bias, as expected, the accuracy is markedly worse since it trains only on mature data (see Figure 6).
Figure 6 compares Poisson log loss on all data. We can surmise that the more accurate models are expected to have a lower Poisson log loss. The proposed model has the highest improvement on the log loss with respect to the baseline (M3) and therefore has higher accuracy on the test data. The model variants without auxiliary features (M5) and thermometer encoding (M4) have a higher Poisson Log Loss than the proposed model. The Poisson log loss improvements also show a direct correlation with the training delay, as expected: the smaller the training delay, the greater the improvement in accuracy since the model is training on more recent data. While M4 and M5 have accuracy improvements comparable to the proposed model, they have positive bias which is not seen in the proposed model.
The difference is more pronounced for high delay campaigns as shown on Figures 7 and 8. This is expected since the proposed model is able to effectively model long tail delay distributions and adjust the label to reflect incomplete data. Other models with shorter delays (M1, M2_7d) are much less accurate on these slices and have significantly worse calibration.
The other similar variants (M4 and M5) are shown to be less accurate and have positive bias due to lack of intermediate information.
Another notable improvement of the proposed model is on new campaigns that are only a few days old, where the data is even more limited. By having auxiliary towers and features to predict delay distributions for these campaigns, the proposed model is able to effectively calibrate and adjust to them quickly while maintaining higher accuracy than the other model variants (Figures 9 and 10). The thermometer encoding is much more important here  as evidenced by the difference in Poisson log loss for new advertisers between M4 and Proposed in Table 2
 since intermediate information and auxiliary features provide crucial signals required to make accurate predictions for new examples. This further bolsters the notion that the proposed model is robust to outliers and limited data compared to the other variants.
Model  All data  Long delay advertisers  New advertisers 

M3  0.0%  0.0%  0.0% 
M1  6.6%  7.68%  0.32% 
M2_delay_7d  6.8%  7.97%  0.6% 
M2_delay_15d  5.9%  7.1%  1.13% 
M4  7.7%  9.13%  0.4% 
M5  7.92%  9.3%  1.7% 
Proposed  8.6%  10.16%  1.81% 
Oracle  9.1%  10.87%  2.0% 
6 Extensions
In this section we will note how our design is versatile enough to handle several modifications to the problem.
Value  In our experiments we evaluated the setting where we predict expected number of postclick conversions. A simple variant of this problem that is very important in the industry is one in which each conversion can have a different value and the goal is to predict the expected total value of postclick conversions. It is straightforward to see that the whole design works with almost no changes and gives an unbiased estimator in this case.
Handling retractions and restatements 
In all previous solutions to handling conversion delay, the papers assume immutability of conversions which have already appeared. In practice advertisers might want to retract or restate some subset of conversions that they reported. This can be due to customers returning an item or a conversion having been found to be fraudulent. It is relatively straightforward to modify our solution to handle retractions and restatements while still obtaining an unbiased estimator. To do this, we will need to split a conversion across different time buckets: a +1 in the bucket in which it happened and 1 in the bucket in which it was retracted. This will define consistent random variables, but can make the label negative in some of the time buckets which cannot be handled by Poisson regression. A simple fix to handling negative labels is to split the label in any time bucket into a positive portion and a negative portion, and to have separate outputs from the neural network to predict each of these.
7 Conclusion
In this paper we introduced a way of handling delayed feedback in conversion optimizer models with many conversions per click. We showed experimentally that it does better than several other solutions as well as via ablation showed that all ideas introduced are necessary. We also showed that it is robust to outliers and limited data. Our solution is likely to be useful for problems in other domains with delayed feedback.
8 Acknowledgments
This authors would like to thank Samuel Ieong, EuJin Goh, and Camille Wormser for their support and valuable inputs.
References
 Regression analysis of count data. Cambridge University Press. Cited by: §5.
 Estimating rates of rare events with multiple hierarchies through scalable loglinear models. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, New York, NY, USA, pp. 213–222. External Links: ISBN 9781450300551, Link, Document Cited by: §2.
 Modeling delayed feedback in display advertising. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA  August 24  27, 2014, pp. 1097–1105. External Links: Link, Document Cited by: §2, §2, §2, §5.
 Delayed feedback model with negative binomial regression for multiple conversions. In Proceedings of the ADKDD’2020, Cited by: §2.

Learning classifiers from only positive and unlabeled data
. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 2427, 2008, Y. Li, B. Liu, and S. Sarawagi (Eds.), pp. 213–220. External Links: Link, Document Cited by: §2.  Beta survival models. CoRR abs/1905.03818. External Links: Link, 1905.03818 Cited by: §2.
 Timeaware conversion prediction. Front. Comput. Sci. 11 (4), pp. 702–716. External Links: ISSN 20952228, Link, Document Cited by: §2.
 Learning classifiers under delayed feedback with a time window assumption. CoRR abs/2009.13092. External Links: Link, 2009.13092 Cited by: §2.
 Addressing delayed feedback for continuous training with neural networks in ctr prediction. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 187–195. External Links: ISBN 9781450362436, Link, Document Cited by: §2.
 Estimating conversion rate in display advertising from past performance data. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 1216, 2012, Q. Yang, D. Agarwal, and J. Pei (Eds.), pp. 768–776. External Links: Link, Document Cited by: §2.

Learning from delayed outcomes via proxies with applications to recommender systems.
In
Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA
, pp. 4324–4332. External Links: Link Cited by: §2.  Ad click prediction: a view from the trenches. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 1114, 2013, I. S. Dhillon, Y. Koren, R. Ghani, T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, and R. Uthurusamy (Eds.), pp. 1222–1230. External Links: Link, Document Cited by: §5.
 Response prediction using collaborative filtering with hierarchies and sideinformation. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 2124, 2011, C. Apté, J. Ghosh, and P. Smyth (Eds.), pp. 141–149. External Links: Link, Document Cited by: §2.
 Postclick conversion modeling and analysis for nonguaranteed delivery display advertising. Proceedings of the fifth ACM international conference on Web search and data mining. 2012., pp. 293–302. Cited by: §2.
 Display advertising: estimating conversion probability efficiently. External Links: 1710.08583 Cited by: §2.
 Dual learning algorithm for delayed conversions. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 2530, 2020, J. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, and Y. Liu (Eds.), pp. 1849–1852. External Links: Link, Document Cited by: §2.
 Antidistillation: improving reproducibility of deep networks. External Links: 2010.09923 Cited by: §5.

An attentionbased model for conversion rate prediction with delayed feedback via postclick calibration.
In
Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, IJCAI 2020
, C. Bessiere (Ed.), pp. 3522–3528. External Links: Link, Document Cited by: §2.  Stochastic bandit models for delayed conversions. In Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 1115, 2017, External Links: Link Cited by: §2.
 A feedback shift correction in predicting conversion rates under delayed feedback. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 2024, 2020, Y. Huang, I. King, T. Liu, and M. van Steen (Eds.), pp. 2740–2746. External Links: Link, Document Cited by: §2.
 A nonparametric delayed feedback model for conversion rate prediction. CoRR abs/1802.00255. External Links: Link, 1802.00255 Cited by: §2.