In this paper we study sequential learning when the feedback about the predictions made by the forecaster are delayed. This is the case, for example, in web advertisement, where the information whether a user has clicked on a certain ad may come back to the engine in a delayed fashion: after an ad is selected, while waiting for the information if the user clicks or not, the engine has to provide ads to other users. Also, the click information may be aggregated and then periodically sent to the module that decides about the ads, resulting in further delays. (Li et al., 2010; Dudik et al., 2011). Another example is parallel, distributed learning, where propagating information among nodes causes delays (Agarwal & Duchi, 2011).
While online learning has proved to be successful in many machine learning problems and is applied in practice in situations where the feedback is delayed, the theoretical results for the non-delayed setup are not applicable when delays are present. Previous work concerning the delayed setting focussed on specific online learning settings and delay models (mostly with constant delays). Thus, a comprehensive understanding of the effects of delays is missing. In this paper, we provide a systematic study of online learning problems with delayed feedback. We consider thepartial monitoring setting, which covers all settings previously considered in the literature, extending, unifying, and often improving upon existing results. In particular, we give general meta-algorithms that transform, in a black-box fashion, algorithms developed for the non-delayed case into algorithms that can handle delays efficiently. We analyze how the delay effects the regret of the algorithms. One interesting, perhaps somewhat surprising, result is that the delay inflates the regret in a multiplicative way in adversarial problems, while this effect is only additive in stochastic problems. While our general meta-algorithms are useful, their time- and space-complexity may be unnecessarily large. To resolve this problem, we work out modifications of variants of the UCB algorithm (Auer et al., 2002) for stochastic bandit problems with delayed feedback that have much smaller complexity than the black-box algorithms.
The rest of the paper is organized as follows. The problem of online learning with delayed feedback is defined in Section 2. The adversarial and stochastic problems are analyzed in Sections 3.1 and 3.2, while the modification of the UCB algorithm is given in Section 4. Some proofs, as well as results about the KL-UCB algorithm (Garivier & Cappé, 2011) under delayed feedback, are provided in the appendix.
2 The delayed feedback model
We consider a general model of online learning, which we call the partial monitoring problem with side information. In this model, the forecaster (decision maker) has to make a sequence of predictions (actions), possibly based on some side information, and for each prediction it receives some reward and feedback, where the feedback is delayed. More formally, given a set of possible side information values , a set of possible predictions , a set of reward functions , and a set of possible feedback values , at each time instant , the forecaster receives some side information ; then, possibly based on the side information, the forecaster predicts some value while the environment simultaneously chooses a reward function ; finally, the forecaster receives reward and some time-stamped feedback set . In particular, each element of is a pair of time index and a feedback value, the time index indicating the time instant whose decision the associated feedback corresponds to.
Note that the forecaster may or may not receive any direct information about the rewards it receives (i.e., the rewards may be hidden). In standard online learning, the feedback-set is a singleton and the feedback in this set depends on . In the delayed model, however, the feedback that concerns the decision at time is received at the end of the time period , after the prediction is made, i.e., it is delayed by time steps. Note that corresponds to the non-delayed case. Due to the delays multiple feedbacks may arrive at the same time, hence the definition of .
The goal of the forecaster is to maximize its cumulative reward . The performance of the forecaster is measured relative to the best static strategy selected from some set in hindsight. In particular, the forecaster’s performance is measured through the regret, defined by
A forecaster is consistent if it achieves, asymptotically, the average reward of the best static strategy, that is , and we are interested in how fast the average regret can be made to converge to .
The above general problem formulation includes most scenarios considered in online learning. In the full information case, the feedback is the reward function itself, that is, and (in the non-delayed case). In the bandit case, the forecaster only learns the rewards of its own prediction, i.e., and . In the partial monitoring case, the forecaster is given a reward function and a feedback function , where is a set of choices (outcomes) of the environment. Then, for each time instant the environment picks an outcome , and the reward becomes , while . This interaction protocol is shown in Figure 1 in the delayed case. Note that the bandit and full information problems can also be treated as special partial monitoring problems. Therefore, we will use this last formulation of the problem. When no stochastic assumption is made on how the sequence is generated, we talk about the adversarial model. In the stochastic setting we will consider the case when
is a sequence of independent, identically distributed (i.i.d.) random variables. Side information may or may not be present in a real problem; in its absenceis a singleton set.
Finally, we may have different assumptions on the delays. Most often, we will assume that is an i.i.d. sequence, which is independent of the past predictions of the forecaster. In the stochastic setting, we also allow the distribution of to depend on .
Note that the delays may change the order of observing the feedbacks, with the feedback of a more recent prediction being observed before the feedback of an earlier one.
2.1 Related work
The effect of delayed feedback has been studied in the recent years under different online learning scenarios and different assumptions on the delay. A concise summary, together with the contributions of this paper, is given in Table 1.
|Stochastic Feedback||General (Adversarial) Feedback|
|Side||(Agarwal & Duchi, 2011)||(Weinberger & Ordentlich, 2002)|
|Full Info||Info||(Langford et al., 2009)|
|(Agarwal & Duchi, 2011)|
|(Mesterharm, 2007)||(Mesterharm, 2007)|
|Bandit||Info||(Desautels et al., 2012)||(Neu et al., 2010)|
|(Dudik et al., 2011)|
To the best of our knowledge, Weinberger & Ordentlich (2002) were the first to analyze the delayed feedback problem; they considered the adversarial full information setting with a fixed, known delay . They showed that the minimax optimal solution is to run independent optimal predictors on the subsampled reward sequences: prediction strategies are used such that the predictor is used at time instants with . This approach forms the basis of our method devised for the adversarial case (see Section 3.1). Langford et al. (2009) showed that under the usual conditions, a sufficiently slowed-down version of the mirror descent algorithm achieves optimal decay rate of the average regret. Mesterharm (2005, 2007)
considered another variant of the full information setting, using an adversarial model on the delays in the label prediction setting, where the forecaster has to predict the label corresponding to a side information vector. While in the full information online prediction problem Weinberger & Ordentlich (2002) showed that the regret increases by a multiplicative factor of , in the work of Mesterharm (2005, 2007) the important quantity becomes the maximum/average gap defined as the length of the largest time interval the forecaster does not receive feedback. Mesterharm (2005, 2007) also shows that the minimax regret in the adversarial case increases multiplicatively by the average gap, while it increases only in an additive fashion in the stochastic case, by the maximum gap. Agarwal & Duchi (2011) considered the problem of online stochastic optimization and showed that, for i.i.d. random delays, the regret increases with an additive factor of order .
Qualitatively similar results were obtained in the bandit setting. Considering a fixed and known delay , Dudik et al. (2011) showed an additive penalty in the regret for the stochastic setting (with side information), while (Neu et al., 2010) showed a multiplicative regret for the adversarial bandit case. The problem of delayed feedback has also been studied for Gaussian process bandit optimization (Desautels et al., 2012), resulting in a multiplicative increase in the regret that is independent of the delay and an additive term depending on the maximum delay.
In the rest of the paper we generalize the above results to the partial monitoring setting, extending, unifying, and often improving existing results.
3 Black-Box Algorithms for Delayed Feedback
In this section we provide black-box algorithms for the delayed feedback problem. We assume that there exists a base algorithm Base for solving the prediction problem without delay. We often do not specify the assumptions underlying the regret bounds of these algorithms, and assume that the problem we consider only differs from the original problem because of the delays. For example, in the adversarial setting, Base may build on the assumption that the reward functions are selected in an oblivious or non-oblivious way (i.e., independently or not of the predictions of the forecaster). First we consider the adversarial case in Section 3.1. Then in Section 3.2, we provide tighter bounds for the stochastic case.
3.1 Adversarial setting
We say that a prediction algorithm enjoys a regret or expected regret bound under the given assumptions in the non-delayed setting if (i) is nondecreasing, concave, ; and (ii) or, respectively, for all . The algorithm of Weinberger & Ordentlich (2002) for the adversarial full information setting subsamples the reward sequence by the constant delay , and runs a base algorithm Base on each of the subsampled sequences. Weinberger & Ordentlich (2002) showed that if Base enjoys a regret bound then their algorithm in the fixed delay case enjoys a regret bound . Furthermore, when Base is minimax optimal in the non-delayed setting, the subsampling algorithm is also minimax optimal in the (full information) delayed setting, as can be seen by constructing a reward sequence that changes only in every times. Note that Weinberger & Ordentlich (2002) do not require condition (i) of . However, these conditions imply that is a concave function of for any fixed (a fact which will turn out to be useful in the analysis later), and are satisfied by all regret bounds we are aware of (e.g., for multi-armed bandits, contextual bandits, partial monitoring, etc.), which all have a regret upper bound of the form for some , with, typically, or .111 means that there is a such that ..
In this section we extend the algorithm of Weinberger & Ordentlich (2002) to the case when the delays are not constant, and to the partial monitoring setting. The idea is that we run several instances of a non-delayed algorithm Base as needed: an instance is “free” if it has received the feedback corresponding to its previous prediction – before this we say that the instance is “busy”, waiting for the feedback. When we need to make a prediction, we use one of existing instances that is free, and is hence ready to make another prediction. If no such instance exists, we create a new one to be used (a new instance is always “free”, as it is not waiting for the feedback of a previous prediction). The resulting algorithm, which we call Black-Box Online Learning under Delayed feedback (BOLD) is shown below (note that when the delays are constant, BOLD reduces to the algorithm of Weinberger & Ordentlich (2002)):
Clearly, the performance of BOLD depends on how many instances of Base we need to create, and how many times each instance is used. Let denote the number of Base instances created by BOLD up to and including time . That is, , and we create a new instance at the beginning of any time instant when all instances are waiting for their feedback. Let be the total number of outstanding (missing) feedbacks when the forecaster is making a prediction at time instant . Then we have algorithms waiting for their feedback, and so . Since we only introduce new instances when it is necessary (and each time instant at most one new instance is created), it is easy to see that
for any , where .
We can use the result above to transfer the regret guarantee of the non-delayed base algorithm Base to a guarantee on the regret of BOLD.
Suppose that the non-delayed algorithm Base used in BOLD enjoys an (expected) regret bound . Assume, furthermore, that the delays are independent of the forecaster’s prediction . Then the expected regret of BOLD after time steps satisfies
As the second inequality follows from the concavity of (), it remains to prove the first one.
For any , let denote the list of time instants in which BOLD has used the prediction chosen by instance , and let be the number of time instants this happens. Furthermore, let denote the regret incurred during the time instants with :
where is the prediction made by BOLD (and instance ) at time instant . By construction, instance does not experience any delays. Hence, is its regret in a non-delayed online learning problem. 222Note that is a function of the delay sequence and is not a function of the predictions . Hence, the reward sequence that instance is evaluated on is chosen obliviously whenever the adversary of BOLD is oblivious. Then,
Now, using the fact that is an (expected) regret bound, we obtain
where the first inequality follows since is a deterministic function of the delays, while the last inequality follows from Jensen’s inequality and the concavity of . Substituting from (1) and taking the expectation concludes the proof. ∎
Now, we need to bound to make the theorem meaningful. When all delays are the same constants, for we get , and we get back the regret bound
of Weinberger & Ordentlich (2002), thus generalizing their result to partial monitoring. We do not know whether this bound is tight even when Base is minimax optimal, as the argument of Weinberger & Ordentlich (2002) for the lower bound does not work in the partial information setting (the forecaster can gain extra information in each block with the same reward functions).
Assuming the delays are i.i.d., we can give an interesting bound on . The result is based on the fact that although can be as large as
, both its expectation and variance are upper bounded by.
Assume is a sequence of i.i.d. random variables with finite expected value, and let . Then
First consider the expectation and the variance of . For any ,
so in the same way as above. By Bernstein’s inequality (Cesa-Bianchi & Lugosi, 2006, Corollary A.3), for any and any
we have, with probability at least,
Applying the union bound for , and our previous bounds on the variance and expectation of , we obtain that with probability at least ,
Taking into account that , we get the statement of the lemma. ∎
Under the conditions of Theorem 1, if the sequence of delays is i.i.d, then
Note that although the delays can be arbitrarily large, whenever the expected value is finite, the bound only increases by a factor.
3.2 Finite stochastic setting
In this section, we consider the case when the prediction set of the forecaster is finite; without loss of generality we assume . We also assume that there is no side information (that is, is a constant for all , and, hence, will be omitted; the results can be extended easily to the case of a finite side information set, where we can repeat the procedures described below for each value of the side information separately). The main assumption in this section is that the outcomes form an i.i.d. sequence, which is also independent of the predictions of the forecaster. When is finite, this leads to the standard i.i.d. partial monitoring (IPM) setting, while the conventional multi-armed bandit (MAB) setting is recovered when the feedback is the reward of the last prediction, that is, . As in the previous section, we will assume that the feedback delays are independent of the outcomes of the environment. The main result of this section shows that under these assumptions, the penalty in the regret grows in an additive fashion due to the delays, as opposed to the multiplicative penalty that we have seen in the adversarial case.
By the independence assumption on the outcomes, the sequences of potential rewards and feedbacks are i.i.d., respectively, for the same prediction . In this setting we also assume that the feedback and reward sequences of different predictions are independent of each other. Let denote the expected reward of predicting , the optimal reward and with the optimal prediction. Moreover, let denote the number of times is predicted by the end of time instant . Then, defining the “gaps” for all , the expected regret of the forecaster becomes
Similarly to the adversarial setting, we build on a base algorithm Base for the non-delayed case. The advantage in the IPM setting (and that we consider expected regret) is that here Base can consider a permuted order of rewards and feedbacks, and so we do not have to wait for the actual feedback; it is enough to receive a feedback for the same prediction. This is the idea at the core of our algorithm, Queued Partial Monitoring with Delayed Feedback (QPM-D):
Here we have a Base partial monitoring algorithm for the non-delayed case, which is run inside the algorithm. The feedback information coming from the environment is stored in separate queues for each prediction value. The outer algorithm constantly queries Base: while feedbacks for the predictions made are available in the queues, only the inner algorithm Base runs (that is, this happens within a single time instant in the real prediction problem). When no feedback is available, the outer algorithm keeps sending the same prediction to the real environment until a feedback for that prediction arrives. In this way Base is run in a simulated non-delayed environment. The next lemma implies that the inner algorithm Base actually runs in a non-delayed version of the problem, as it experiences the same distributions:
Consider a delayed stochastic IPM problem as defined above. For any prediction , for any let denote the feedback QPM-D receives for predicting . Then the sequence is an i.i.d. sequence with the same distribution as the sequence of feedbacks for prediction .
To relate the non-delayed performance of Base and the regret of QPM-D, we need a few definitions. For any , let denote the number of feedbacks for prediction that are received by the end of time instant . Then the number of missing feedbacks for when making a prediction at time instant is . Let . Furthermore, for each , let be the number of times algorithm Base has predicted while being queried times. Let denote the number of steps the inner algorithm Base makes in steps of the real IPM problem. Next we relate and , as well as the number of times QPM-D and Base (in its simulated environment) make a specific prediction.
Suppose QPM-D is run for time instants, and has queried Base times. Then and
Since Base can take at most one step for each feedback that arrives, and QPM-D has to make at least one step for each arriving feedback, .
Now, fix a prediction . If Base, and hence, QPM-D, has not predicted by time instant , (3) trivially holds. Otherwise, let denote the last time instant (up to time ) when QPM-D predicts . Then . Suppose Base has been queried times by time instant (inclusive). At this time instant, the buffer must be empty and Base must be predicting , otherwise QPM-D would not predict in the real environment. This means that all the feedbacks that have arrived before this time instant have been fed to the base algorithm, which has also made an extra step, that is, . Therefore,
We can now give an upper bound on the expected regret of Algorithm 2.
Suppose the non-delayed Base algorithm is used in QPM-D in a delayed stochastic IPM environment. Then the expected regret of QPM-D is upper-bounded by
where is the expected regret of Base when run in the same environment without delays.
When the delay is bounded by for all , we also have , and . When the sequence of delays for each prediction is i.i.d. with a finite expected value but unbounded support, we can use Lemma 2 to bound , and obtain a bound .
Assume that QPM-D is run longer so that Base is queried for times (i.e., it is queried more times). Then, since , the number of times is predicted by the base algorithm, namely , can only increase, that is, . Combining this with the expectation of (3) gives
which in turn gives,
As shown in Lemma 4, the reordered rewards and feedbacks are i.i.d. with the same distribution as the original feedback sequence . The base algorithm Base has worked on the first of these feedbacks for each (in its extended run), and has therefore operated for steps in a simulated environment with the same reward and feedback distributions, but without delay. Hence, the first summation in the right hand side of (5) is in fact , the expected regret of the base algorithm in a non-delayed environment. This concludes the proof. ∎
4 UCB for the Multi-Armed Bandit Problem with Delayed Feedback
While the algorithms in the previous section provide an easy way to convert algorithms devised for the non-delayed case to ones that can handle delays in the feedback, improvements can be achieved if one makes modifications inside the existing non-delayed algorithms while retaining their theoretical guarantees. This can be viewed as a ”white-box” approach to extending online learning algorithms to the delayed setting, and enables us to escape the high memory requirements of black-box algorithms that arises for both of our methods in the previous section when the delays are large. We consider the stochastic multi-armed bandit problem, and extend the UCB family of algorithms (Auer et al., 2002; Garivier & Cappé, 2011) to the delayed setting. The modification proposed is quite natural, and the common characteristics of UCB-type algorithms enable a unified way of extending their performance guarantees to the delayed setting (up to an additive penalty due to delays).
Recall that in the stochastic MAB setting, which is a special case of the stochastic IPM problem of Section 3.2, the feedback at time instant is , and there is a distribution from which the rewards of each prediction are drawn in an i.i.d. manner. Here we assume that the rewards of different predictions are independent of each other. We use the same notation as in Section 3.2.
Several algorithms devised for the non-delayed stochastic MAB problem are based on upper confidence bounds (UCBs), which are optimistic estimates of the expected reward of different predictions. Different UCB-type algorithms use different upper confidence bounds, and choose, at each time instant, a prediction with the largest UCB. Letdenote the UCB for prediction at time instant , where is the number of reward samples used in computing the estimate. In a non-delayed setting, the prediction of a UCB-type algorithm at time instant is given by In the presence of delays, one can simply use the same upper confidence bounds only with the rewards that are observed, and predict
at time instant (recall that is the number of rewards that can be observed for prediction before time instant ). Note that if the delays are zero, this algorithm reduces to the corresponding non-delayed version of the algorithm.
The algorithms defined by (6) can easily be shown to enjoy the same regret guarantees compared to their non-delayed versions, up to an additive penalty depending on the delays. This is because the analyses of the regrets of UCB algorithms follow the same pattern of upper bounding the number of trials of a suboptimal prediction using concentration inequalities suitable for the specific form of UCBs they use.
As an example, the UCB1 algorithm (Auer et al., 2002) uses UCBs of the form , where is the average of the first observed rewards. Using this UCB in our decision rule (6), we can bound the regret of the resulting algorithm (called Delayed-UCB1) in the delayed setting:
For any , the expected regret of the Delayed-UCB1 algorithm is bounded by
Note that the last term in the bound is the additive penalty, and, under different assumptions, it can be bounded in the same way as after Theorem 6. The proof of this theorem, as well as a similar regret bound for the delayed version of the KL-UCB algorithm (Garivier & Cappé, 2011) can be found in Appendix B.
5 Conclusion and future work
We analyzed the effect of feedback delays in online learning problems. We examined the partial monitoring case (which also covers the full information and the bandit settings), and provided general algorithms that transform forecasters devised for the non-delayed case into ones that handle delayed feedback. It turns out that the price of delay is a multiplicative increase in the regret in adversarial problems, and only an additive increase in stochastic problems. While we believe that these findings are qualitatively correct, we do not have lower bounds to prove this (matching lower bounds are available for the full information case only).
It also turns out that the most important quantity that determines the performance of our algorithms is , the maximum number of missing rewards. It is interesting to note that
is the maximum number of servers used in a multi-server queuing system with infinitely many servers and deterministic arrival times. It is also the maximum deviation of a certain type of Markov chain. While we have not found any immediately applicable results in these fields, we think that applying techniques from these areas could lead to an improved understanding of, and hence an improved analysis of online learning under delayed feedback.
This work was supported by the Alberta Innovates Technology Futures and NSERC.
- Agarwal & Duchi (2011) Agarwal, Alekh and Duchi, John. Distributed delayed stochastic optimization. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24 (NIPS), pp. 873–881, 2011.
- Auer et al. (2002) Auer, Peter, Cesa-Bianchi, Nicolò, and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, May 2002.
- Cesa-Bianchi & Lugosi (2006) Cesa-Bianchi, Nicolò and Lugosi, Gábor. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. ISBN 0521841089.
- Desautels et al. (2012) Desautels, Thomas, Krause, Andreas, and Burdick, Joel. Parallelizing exploration-exploitation tradeoffs with gaussian process bandit optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, UK, 2012. Omnipress.
- Doob (1953) Doob, Joseph L. Stochastic Processes. John Wiley & Sons, 1953.
Dudik et al. (2011)
Dudik, Miroslav, Hsu, Daniel, Kale, Satyen, Karampatziakis, Nikos, Langford,
John, Reyzin, Lev, and Zhang, Tong.
Efficient optimal learning for contextual bandits.
Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 169–178, Corvallis, Oregon, 2011. AUAI Press.
- Garivier & Cappé (2011) Garivier, Aurélien and Cappé, Olivier. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), volume 19, pp. 359–376, Budapest, Hungary, July 2011.
- Hoeffding (1963) Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- Langford et al. (2009) Langford, John, Smola, Alexander, and Zinkevich, Martin. Slow learners are fast. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 2331–2339. 2009.
- Li et al. (2010) Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), pp. 661–670, New York, NY, USA, 2010. ACM.
- Mesterharm (2005) Mesterharm, Chris J. On-line learning with delayed label feedback. In Jain, Sanjay, Simon, HansUlrich, and Tomita, Etsuji (eds.), Algorithmic Learning Theory, volume 3734 of Lecture Notes in Computer Science, pp. 399–413. Springer Berlin Heidelberg, 2005.
- Mesterharm (2007) Mesterharm, Chris J. Improving on-line learning. PhD thesis, Department of Computer Science, Rutgers University, New Brunswick, NJ, 2007.
Neu et al. (2010)
Neu, Gergely, György, András, Szepesvári, Csaba, and Antos,
Online markov decision processes under bandit feedback.In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23 (NIPS), pp. 1804–1812, 2010.
- Titchmarsh & Heath-Brown (1987) Titchmarsh, Edward Charles and Heath-Brown, David Rodney. The Theory of the Riemann Zeta-Functions. Oxford University Press, second edition edition, January 1987.
- Weinberger & Ordentlich (2002) Weinberger, Marcelo J. and Ordentlich, Erik. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959–1976, September 2002.
Appendix A Proof of Lemma 4
In this appendix we prove Lemma 4 that was used in the i.i.d. partial monitoring setting (Section 3.2). To that end, we will first need two other lemmas. The first lemma shows that the i.i.d. property of a sequence of random variables is preserved under an independent random reordering of that sequence.
Let , be a sequence of independent, identically distributed random variables. If we reorder this sequence according to an independent random permutation, then the resulting sequence is i.i.d. with the same distribution as .
Let the reordered sequence be denoted by . It is sufficient to show that for all , for all , we have
Since is i.i.d., for any fixed permutation the equation above holds as both sides are equal to . Since the permutations are independent of the sequence
, using the law of total probability this extends to the general case as well. ∎
We also need the following result (Doob, 1953, Page 145, Chapter III, Theorem 5.2).
Let be a sequence of i.i.d. random variables, and be a subsequence of it such that the decision whether to include in the subsequence is independent of future values in the sequence, i.e., of for . Then the sequence is an i.i.d. sequence with the same distribution as .
We can now proceed to the proof of Lemma 4.
Proof of Lemma 4.
Let be the sequence resulting from sorting the variables by their possible observation times (that is, is the earliest feedback that can be observed if is predicted at the appropriate time, and so on). Since delays are independent of the outcomes, they define an independent reordering on the sequence of feedbacks. Hence, by Lemma 8, is an i.i.d. sequence with the same distribution as . Note that , the sequence of feedbacks (sorted by their observation times) that the agent observes for predicting , is a subsequence of where the decision whether to include each in the subsequence cannot depend on future possible observations . Also, the feedbacks of other predictions that are used in this decision were assumed to be independent of . Hence, by Lemma 9, is an i.i.d. sequence with the same distribution as , which in turn has the same distribution as . ∎
Appendix B UCB for the Multi-Armed Bandit Problem with Delayed Feedback
This appendix details the framework we described in Section 4 for analyzing UCB-type algorithms in the delayed settings, and provides the missing proofs. The regret of a UCB algorithm is usually analyzed by upper bounding the (expected) number of times a suboptimal prediction is made, and then using Equation (2) to get an expected regret bound. Consider a UCB algorithm with upper confidence bounds , and fix a suboptimal prediction . The typical analysis (e.g., by Auer et al. (2002)) considers the case when this prediction is made for at least times (for a large enough ), and uses concentration inequalities suitable for the specific form of the upper-confidence bound to show that it is unlikely to make this suboptimal prediction more than times because observing samples from its reward distribution suffices to distinguish it from the optimal prediction with high confidence. This value thus gives an upper bound on the expected number of times is predicted. Examples of such concentration inequalities include Hoeffding’s inequality (Hoeffding, 1963) and Theorem 10 of Garivier & Cappé (2011), which are used for the UCB1 and KL-UCB algorithms, respectively.
More precisely, the general analysis of UCB-type algorithms in the non-delayed setting works as follows: for , we have , where the sum on the right hand side captures how much larger than the value of is (recall that is the number of times is predicted up to and including time ). Whenever is predicted, its UCB, , must have been greater than that of an optimal prediction, , which implies
The expected value of the summation on the right-hand-side is then bounded using concentration inequalities as mentioned above.
In the delayed-feedback setting, if we use upper confidence bounds instead (where was defined to be the number of rewards observed up to and including time instant ), in the same way as above we can write
Since , with we get
Now the same concentration inequalities used to bound (7) in the analysis of the non-delayed setting can be used to upper bound the expected value of the sum in (8). Putting this into (2), we see that one can reuse the same upper confidence bound in the delayed setting (with only the observed rewards) and get a performance similar to the non-delayed setting, with only an additive penalty that depends on the delays. The following two sections demonstrate the use of this method on two UCB-type algorithms.
b.1 UCB1 under delayed feedback: Proof of Theorem 7
Proof of Theorem 7.
Following the outline of the previous section, we can bound the summation in (8) using the same analysis as in the original UCB1 paper (Auer et al., 2002). In particular, for any prediction we can write
The event in the second summation implies that either or (otherwise we will have ). Hence,
Choosing makes the events in the last summation above impossible, because . Therefore, combining with (8), we can write
|Taking expectation gives|
As in the original analysis, Hoeffding’s inequality (Hoeffding, 1963) can be used to bound each of the probabilities in the summation, to get