Long term utility of recommendations is an important indicator of recommendation quality (cf. (Wu et al., 2017)); hence optimizing for long term value is desirable in many practical applications. For instance, an online book seller may want to maximize the sale of ebooks that customers finish reading (assuming that customers who engage with purchased content are likely to return to the store). Similarly, an online video subscription service may want to maximize the number of videos a customer finishes watching (to avoid recommending ‘click-bait’ content). Unfortunately, long term value is only revealed after a significant delay, which is problematic when training a model to predict long term value based on historical data. The delay causes training-serving skew (Zinkevich, 2017), and hence the model’s predictions for new products may be inaccurate, particularly in large-scale systems where new items arrive frequently. While generalization can help to mitigate this problem, per product memorization is important for large scale recommender systems (Cheng et al., 2016).
Motivating Example: Consider an online marketplace for ebooks. We define an engagement event to occur when a customer finishes an ebook within 90 days of conversion. Similar to YouTube video recommendations (Davidson et al., 2010)
, when a customer visits, the marketplace (1) generates a small set of ebook candidates, (2) scores each of the candidates, and (3) presents them in descending order of predicted probability of engagement. If an ebook is purchased, whether or not a successful engagement will occur is unknown. The approach taken by much of the prior work on learning with delay(Weinberger and Ordentlich, 2002; Mesterharm, 2005; Joulani et al., 2013; Quanrud and Khashabi, 2015) would imply waiting the whole 90 days to see the outcome. However, new books are added to the marketplace every day. If we waited 90 days before we can accurately predict the engagement probability of a new book, the marketplace will miss out on many potential sales. An alternative approach could make use of intermediate observations, which we denote by ISym (for intermediate symbol) throughout this paper. For example, one day after a purchase we can define ISyms based on the furthest page reached in the ebook. Clearly, these ISyms provide some information about the eventual outcome.
Intermediate Observations and Outcomes: To exploit ISyms, we need to formalize their relationship to outcomes. For example, ISyms based on independent, random noise obviously do not provide useful information about outcomes. The key assumption we use is that the outcome distribution can be factored into two models: (1) one that generates ISyms given an instance, and (2) one that generates outcomes given an ISym. An intuitive algorithm could try to learn both models. The first model can be updated as soon as an ISym is available, while the second model is slower to learn but generalizes across instances.
The Problem: In the ebook marketplace, ebook candidates are not generated by a fixed distribution, since new books are being added over time. Furthermore, a useful scoring algorithm should be agnostic to the candidate retrieval process. For these reasons, we model the problem as a semi-stochastic online learning problem with ISyms, where instances may be selected by an adversary but, given an instance, the outcome and the ISyms are sampled from a joint distribution. In our setting, if ISyms are ignored, a lower bound on the cumulative error is where is the maximum outcome delay, is the number of instances (e.g., total corpus size), and is the performance horizon. Note that in a recommender system where we make 10,000 predictions per day and only update the forecaster each evening, a 90 day delay would translate into a lower bound on cumulative error proportional to . We show that a much improved cumulative error can be achieved by a predictor with a proper factored structure, where is the number of possible ISyms and the tilde indicates that we have suppressed logarithmic factors. Examining these two bounds, we can see that ISyms help when the delay and number of instances are large and the number of ISyms is small.
Practical Deep Implementations: Using this intuition we propose a neural network-based online learner (FF) with two modules learning the above two models separately. However, we found that the factorization assumption is violated in both of our experimental domains. To mitigate this problem, we introduce a second neural network-based learner (RFF) that introduces a residual correction term. We compare both of these factored algorithms to an algorithm that ignores ISyms (DF) on two experimental domains: (1) predicting commit activity for GitHub repositories, and (2) predicting engagement with items acquired from a popular marketplace. In both of these domains, RFF outperforms both DF and FF.
Contribution: This paper offers three main contributions:
We formalize the problem of learning from delayed outcomes with ISyms as a semi-stochastic online learning problem. Many prior works have analyzed learning with delayed outcomes (e.g., Weinberger and Ordentlich, 2002; Mesterharm, 2005; Joulani et al., 2013), however, we believe this is the first work to consider a setting where ISyms are exploited to mitigate the impact of delay.
We quantify the potential gain for using ISyms under an assumption on the relationship between ISyms and outcomes. In particular, exploiting ISyms helps most when the outcome delay is large and the number of ISyms is small.
Finally, we introduce a practical neural network implementation, RFF, for exploiting ISyms. Our experiments provide evidence that RFF outperforms a delayed learner even when the assumptions required by our analysis are violated.
2 Formal Problem Description & Approach
Let , , be finite, nonempty sets representing the instances, outcomes, and ISyms, respectively. We denote the number of instances by , the number of labels by , the number of different ISyms by , and the -dimensional simplex by for any . The outcome distribution function is . Let where is the number of prediction rounds and is the maximum number of steps that an outcome can be delayed (the delay is defined to be if the label is received at the end of the round). At each round, the environment generates such that is chosen by an adversary and is sampled according to , independently in every round .111This deviates from the typical adversarial setting where both instances and labels are generated adversarially. We will discuss how ISyms are generated below. Using the instance , the forecaster makes a prediction and incurs instantaneous loss . At the end of round , the forecaster receives a possibly empty set containing pairs where indicates the round that the label was generated on. The goal of the forecaster is to minimize cumulative error over rounds.
Joulani et al. (2013) introduce a strategy for converting a base online learner into a delayed online learner, both for the adversarial and stochastic setting (but not for the hybrid setting we consider). We refer to this approach or any other approach that learns a direct relationship between instances and outcomes as a direct forecaster for delayed outcomes. As mentioned in the introduction, a lower bound on the cumulative error in our setting where ISyms are ignored is . Thus, if we want to do better, we need to make an assumption about the relationship between how the outcome and the ISyms are generated.
At round the environment generates , is revealed at the beginning of the round, is revealed at the end of the round,222Our experiments consider the case where intermediate observations are also delayed. and is revealed after at most additional rounds. Given , and given , , independently for all rounds.
Note that Assumption 1 implies that for all , the probability of is
says that the outcome is generated by a stochastic process that can be factorized into two conditional probability distributionsand . Since
does not depend on an instance it can be estimated regardless of the sequence of instances chosen by the adversary. On the other hand,does depend on an instance, but the ISyms are revealed by the environment much sooner than the label. So can also be updated more quickly than trying to estimate directly from labels.
The formal setting in this paper is motivated by online marketplaces where we want to predict consumer engagement with purchased products. An instance represents features about the users and the content being considered for recommendation, while a label indicates what kind of engagement occurred, with the simplest being where represents “no engagement” and represents “successful engagement”. While it is tempting to analyze this as a bandit problem, real online market places often rank content based on multiple factors. Since multiple factors are used to determine which items are recommended, there is rarely a predictable distribution over future instances available. Thus, we allow instances to be chosen adversarially.
Proposed Approach: The basic idea is to exploit the factored model for the outcome distribution introduced in Assumption 1. Algorithm 1 learns empirical estimates of and and then uses the factorization (1) to make predictions. Since does not depend on an instance , the algorithm can improve the estimate of no matter what sequence of instances the adversary selects. Although does depend on an instance, the ISym is revealed on the same round. So the agent can update quickly when the adversary selects an instance that has not been observed in the past.
Before we analyze Algorithm 1, we need one additional assumption.
Let and be a subset of distributions over documents such that At each round , the adversary first chooses and then selects the instance , independently at all rounds.
Assumption 2 is needed to ensure that we can estimate quickly. If , the adversary might select instances where some ISym does not occur and then towards the end of the episode start introducing instances where the observation is very probable causing any algorithm to have to wait for the label rather than approximating (1).
While the analysis is straightforward, Theorem 1 provides valuable insight about when learning from intermediate observations is helpful. The proof can be found in Section A of the supplementary material.
Before discussing Theorem 1, note that if intermediate observations are ignored, a lower bound
can be derived. It is well-known that estimating the mean of a Bernoulli random variable fromsamples results in an error (see, e.g., Devroye et al. (1996)). If the outcomes are delayed by rounds, at least a constant positive loss is suffered on average in the first rounds (before any observation is received). This implies that in the delayed case, a lower bound on the error is
for a single Bernoulli distribution. Now consider the case that we havedifferent instances, and each is repeated times. In each of these segments the minimum loss suffered is , and so the cumulative loss is at least .
Compared to the lower bound that ignores intermediate observations, the bound of Theorem 1 scales more favorably when the delay and number of instances are large, since it depends on rather than . However, we pay an additional price for learning with ISyms. If is very small (meaning that there is at least one ISym that is unlikely to be observed), it can take many rounds to learn a good approximation of for each . Furthermore, we also pay a price for introducing a large number of ISyms.
2.1 Neural Network Architectures for Learning from Delayed Outcomes
Based on our analysis, we propose three neural network architectures for learning from delayed outcomes with intermediate observations.
Direct forecaster (DF): We use a single neural network to predict the distribution over outcomes given an instance (Figure 1(a)). This approach ignores ISyms.
Factored forecaster (FF): This approach learns two neural networks (Figure 1(b) and 1(c)). The first (Figure 1(b)) predicts a distribution over ISyms given an instance, while the second (Figure 1(c)) predicts an outcome distribution given an ISym. This approach is similar to Algorithm 1.
). However, the neural network that predicts an outcome distribution has two towers. The first tower only uses the ISym to predict the logits, while the second tower is an instance-dependent residual correction that can help to correct predictions when Assumption1 does not hold exactly. Furthermore, we train the residual tower with a separate loss and stop backpropogation from that loss to the first tower. This ensures that the second tower is treated as a residual correction helping to preserve generalization across instances.
For all experiments and unless stated otherwise, we update network weights using Stochastic Gradient Descent333We tried to use other optimizers but found that most were not sensitive enough to changes in the distribution over instances as they keep track of an historical average over gradients. with a learning rate of minimizing the log-loss. Except for the networks predicting the outcome distribution from ISyms (Figure 1(c) and left tower in Figure 1(d)), all network towers have two hidden layers. Their output layer is sized appropriately (to output the correct number of class logits) and use a softmax activation. We apply regularization on the weights with a scale parameter of . The networks predicting the outcome distribution from ISyms do not contain hidden layers, they do not use regularization or bias in their output layer. Training samples are stored in a fixed-sized FIFO replay buffer from which we sample uniformly.
For the experiment explained in Section 3.1, the networks predicting the outcome distribution from ISyms use a learning rate of . Network towers have two hidden layers with 40 and 20 units. The training buffer has a size of 1,000. We start training once we have 128 examples in the buffer and perform one gradient step with a batch size of 128 every four rounds.
For the experiment explained in Section 3.2, the networks predicting the outcome distribution from ISyms use a learning rate of . Network towers have two hidden layers with 20 and 10 units. The training buffer has a size of 3,000. We start training once we have 500 examples in the buffer and perform 20 gradient steps with a batch size of 128 every 1,000 rounds.444To simulate less frequent updates to the networks. The parameters were tuned using grid search.
3 Experiments & Results
We compare cumulative error of DF, FF, and RFF in two domains: (1) predicting the commit activity of GitHub repositories, and (2) predicting engagement with items acquired from a marketplace.
3.1 GitHub Commit Activity
The goal is to predict the number of commits made to repositories from GitHub555http://www.github.com - a popular website hosting open source software projects. Given a repository, the question we want the online learner to answer is “will there be at least three commits in the next three weeks?”. This information could be used to predict churn rate. For example, GitHub could potentially intervene by sending a reminder email. We obtained historical information about commits to GitHub repositories from the BigQuery GitHub database.666https://cloud.google.com/bigquery/public-data/github We started with 100,000 repositories and filtered out repositories with fewer than five unique days with commits between May 1, 2017 and January 8, 2018. This resulted in about 8,300 repositories. In our experiments, an adversary selects both a repository and a timestamp. The outcome is one if there were at least three commits over the 21 days following the chosen timestamp and zero otherwise. The ISym is based on the number of commits over the first seven days following the chosen timestamp. These were mapped to three ISyms: (1) no commits, (2) one commit, and (3) more than one commits. The ISym is delayed by one week and the outcomes are delayed by three weeks. The forecaster receives the equivalent of one sample every 10 minutes.
The adversary initially selects repositories from a subset with a low number of commits - one or fewer commits over the past month. After four and a half weeks, the adversary switches to a distribution that samples from repositories with two or more commits within the past week. Figure 2 shows the outcome probabilities for each ISym under the two distributions used by the adversary. Since the outcome probabilities given an ISym are not equal, this violates Assumption 1. Thus, this makes the task more difficult for the factored architectures but does not matter for the direct architecture. In this experiment, the historical information about a repository defines an instance. We used used binary features to represent the programming languages present in a repository as well as time bucketized counts of historical commit activity for that repository.
To generate the cumulative error, we subtract the loss of an optimal forecaster. Since we do not have access to an optimal forecaster, we trained two models with the same architecture as DF on both modes used by the adversary. We trained these models using the Adam optimizer for 10,000 steps and using an initial learning rate of 0.0005.
Figure 3 compares the loss777For all experiments, we show the excess log-loss with respect an optimal forecaster (since we train on that loss). However, results are similar when using the L1-loss instead. of the direct and factored architectures averaged over 200 independent trials. The vertical dashed line indicates the time at which the adversary switches from its initial distribution to a distribution over high commit repositories. Due to the outcome delay, all algorithms suffer the same initial loss for roughly 3 weeks. Then all three algorithms quickly achieve a low loss until the adversary changes the distribution over repositories. Finally, the factored architectures (FF and RFF) recover roughly 2.5 weeks more quickly than DF (as can be seen in Figure 3 from week 5.5–8). Figure 3 shows the cumulative error for the same experiment. FF and RFF achieve smaller cumulative error as they are better able to adapt to changing distributions. We can also clearly observe that FF is not able to maintain a low loss (when compared to DF and RFF) as it is unable to cope with the slightly incorrect factorization assumption.
further averaged over five consecutive datapoints) in a prediction task where the adversary shifts the distribution of instances partway through the episode (indicated by the dashed vertical line). We measure the excess loss with respect to an optimally trained model. The shaded area represents 95% confidence intervals.
3.2 Engagement with Marketplace Items
For a popular marketplace with personalized recommendations, we want to predict the probability that after acquiring an item a user will (1) engage with that item more than once, and (2) not delete that item within 7 days after acquiring it (e.g., a user that downloads a new ebook will read at least two chapters and not delete this ebook from their device). The ISyms are measured two days after an item is acquired and the possible outcomes are: (1) Deleted: The item was deleted. (2) Zero Engagements: The user engaged with the item zero times but did not delete it. (3) One Engagement: The user engaged with the item exactly once and did not delete it. (4) Many Engagements: The user engaged with the item two or more times and did not delete it. The outcomes are the same as the ISyms but measured seven days after conversion.
We collected 21 days of data between the 10th and 31st of January 2018. For each item with at least 100 conversions, we stored the empirical probability of each ISym and the probability of each outcome conditioned on ISyms. In addition, we stored the empirical probability that an item was acquired on each of the 21 days. We consider two distribution schedules (Figure 4). The first schedule, which we refer to as Staggered (Figure 4(a)), subsampled 100 items for each day (at noon) that appear for the first time on that day in the marketplace (i.e., the candidate items with higher indices only appear on increasingly later days). This schedule simulates items continually being added to the marketplace. This is a very hard setting, which we expect DF to perform poorly in due the need to constantly make predictions about new instances. The second schedule, which we refer to as Uniform, is derived by subsampling 100 items uniformly (Figure 4(b)
) and creating a 21 day schedule with half-day intervals by interpolating between the empirical distribution for each item based on the frequencies from the 21 logged days. This schedule is more favorable for DF because instances are sampled according to a slowly shifting distribution. Overall, the ISyms are delayed by two days and the outcomes are delayed by seven days. The forecaster receives the equivalent of one sample every 40 seconds.
In this experiment, we encoded an item/instance using a 100-dimensional one-hot encoding specifying items by index. Similarly to the GitHub experiment, to generate the cumulative error, we subtract the loss of an optimal forecaster. We trained 42 models (one for each half day interval) with the same architecture as DF on all modes used by the adversary (one model for each half-day interval). We train these models using theAdam optimizer and trained for 40,000 steps using an initial learning rate of 0.0005.
Figure 5 compares the average cumulative error of all three forecasters over 200 independent trials for both adversarial schedules. For the Staggered schedule (Figure 5), FF and RFF outperform DF, as expected. DF must wait until the outcomes become available, but FF and RFF are able to generalize to new instances. RFF performs slightly better than FF, indicating that RFF can mitigate the incorrect factorization assumption as delayed outcomes become available.
The Uniform schedule (Figure 5) is easier since the distribution over instances is shifting slowly. DF, FF, and RFF all achieve smaller cumulative error compared to the Staggered schedule. In the Uniform schedule, DF outperforms FF because the factorization assumption is violated by the data. However, RFF achieves similar results to DF because its residual tower allows it to mitigate the error introduced by the incorrect factorization assumption.
4 Related Work
Chapelle (2014) proposes a model for learning from delayed conversions. However, this approach does not take advantage of potential intermediate observations. Learning from delayed labels is also related to survival analysis (Yu et al., 2011; Fernández et al., 2016), where the goal is to model the time until a delayed event. A significant difference of our work is the use of intermediate observations.
A large body of literature exists in online learning for both the adversarial and stochastic partial monitoring settings with delayed intermediate observation, analyzing the regret (see, e.g., Weinberger and Ordentlich (2002); Mesterharm (2005); Agarwal and Duchi (2011); Joulani et al. (2013, 2016)). However, none of the settings takes intermediate observations into consideration for each prediction. Either the intermediate observation for a prediction at round has been revealed or it has not. In our setting, each prediction round is associated with an intermediate observation symbol revealed sooner than the label.
In a stochastic online learning setting, where the instances and labels are sampled from the same distribution at each round, the regret (Joulani et al., 2013) or mistake bounds (Mesterharm, 2005) scale with (i.e., delay plus number of instances; the exact dependence on the time horizon
depends on the loss function). However, the stochastic assumption is not realistic for online marketplaces because new products are being added on a regular basis. Thus, the distribution over instances is not independent and identically distributed from day to day.
We present a way to leverage intermediate observations to learn faster in scenarios where the long term labels are delayed. Our theoretical analysis shows that the cumulative error of the factored approach which exploits intermediate observations scales as unlike a naive approach that scales as . We present experimental results on a dataset from GitHub as well as a dataset from a real marketplace, and show that our algorithms can learn faster when intermediate observations are helpful, and can gracefully recover the baseline performance when intermediate observations are unhelpful. We believe that the proposed approach can be beneficial in many real-world applications where the goal is to optimize for long term value. It would be interesting to extend our theoretical analysis to ranking measures.
We would like to thank David Silver for helpful discussions regarding learning from delayed signals and Tor Lattimore for reviewing this manuscript.
- Agarwal and Duchi  Alekh Agarwal and John Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24 (NIPS), pages 873–881, 2011.
- Angluin and Valiant  D. Angluin and L.G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18(2):155–193, 1979.
- Chapelle  Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 1097–1105, 2014. ISBN 978-1-4503-2956-9.
Cheng et al. 
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.
Wide & deep learning for recommender systems.In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10, 2016.
- Davidson et al.  James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pages 293–296. ACM, 2010.
Devroye et al. 
L. Devroye, L. Györfi, and G. Lugosi.
A Probabilistic Theory of Pattern Recognition. Applications of Mathematics: Stochastic Modelling and Applied Probability. Springer-Verlag New York, 1996.
- Fernández et al.  Tamara Fernández, Nicolás Rivera, and Yee Whye Teh. Gaussian processes for survival analysis. In Advances in Neural Information Processing Systems, pages 5021–5029, 2016.
Joulani et al. 
Pooria Joulani, András György, and Csaba Szepesvári.
Online learning under delayed feedback.
In Proceedings of the
International Conference on Machine Learning, 2013.
Joulani et al. 
Pooria Joulani, András György, and Csaba Szepesvári.
Delay-tolerant online convex optimization: Unified analysis and
Proceedings of the 30th Conference on Artificial Intelligence (AAAI-16), 2016.
- Mesterharm  Chris Mesterharm. On-line learning with delayed label feedback. In International Conference on Algorithmic Learning Theory, pages 399–413. Springer, 2005.
- Quanrud and Khashabi  Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1270–1278. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5833-online-learning-with-adversarial-delays.pdf.
- Weinberger and Ordentlich  Marcelo J Weinberger and Erik Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959–1976, 2002.
- Wu et al.  Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. Returning is believing: Optimizing long-term user engagement in recommender systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1927–1936, 2017.
- Yu et al.  Chun-Nam Yu, Russell Greiner, Hsiu-Chin Lin, and Vickie Baracos. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. In Advances in Neural Information Processing Systems, pages 1845–1853, 2011.
- Zinkevich  Martin Zinkevich. Rules of machine learning: Best practices for ML engineering, 2017.
Appendix A Proof of Theorem 1
Let denote the number of times we observe a feedback–label pair by the end of round , and the number of times we observe an ISyms . Note that . Furthermore, let and denote the number of times we observe a side information–feedback pair
and, respectively, a feature vectorby the end of round (note that ).
Define the estimators and ; note that is the number of times we observe labels for the feedback before the th prediction is made. Let (so that ).
By the union bound and the Hoeffding-Azuma inequality [Devroye et al., 1996], we can obtain concentration bounds for and . Let denote the event that
hold simultaneously for all , where the right hand side is defined to be infinity when the corresponding counts in the denominator ( and, resp., ) are zero. Then holds with probability at least .
The error of the estimate at time for any and can be bounded as
Given , if , the first term can be bounded by . Furthermore, taking into account that , the second term is bounded by as long as .
Bounding the first expression for all is simple, as we get the ISyms immediately after making a prediction. For any fixed side information , we can use the concentration bounds for our estimates from the second time is observed. Thus, given ,
Summing up for all and using Jensen’s inequality with the concavity of the square root function and , we get
To handle the second term, we are going to use Assumption 2. Fix . By the assumption, there exists a sequence of independent and identically distributed random variables coupled to such that if , and . Assuming we observe exactly when we observe , let denote the observations up to the end of round . Then clearly . Furthermore, since , by the multiplicative Chernoff bound [Angluin and Valiant, 1979], for any , if with probability at least . Defining , by Lemma 1 below, is satisfied for all . Therefore, since , with probability at least , simultaneously for all and . Therefore, for all with probability at least , giving for . Also assuming holds, we see that with probability at least ,
Setting , the expected loss can be bounded as
Let and real. Then if .
By the convexity of the logarithm, using a first order Taylor expansion at , for any ,
Now the statement follows by solving