1 Introduction
MultiArmed Bandits (MAB) have been a well studied problem in machine learning theory for capturing the explorationexploitation trade off in online decision making. MAB has applications to domains like ecommerce, computational advertising, clinical trials, recommendation systems, etc.
In most of the real world applications, assumptions of the original theoretical MAB model like immediate rewards, nonstochasticity of the rewards, etc do not hold. A more natural setting is when the rewards of pulling bandit arms are delayed in the future since the effects of the actions are not always immediately observed. PikeBurke et al. (2018) first explored this setting assuming stochastic rewards for pulling an arm which are obtained at some specific time step in the future. This setting is called delayed, aggregated, anonymous feedback (DAAF). The complexity of this problem stems from anonymous feedback to the model due to its inability to distinguish the origin of rewards obtained at a particular time from any of the previous time steps.
This work was extended by adding a relaxation to the temporal specificity of observing the reward at one specific time in the future by CesaBianchi et al. (2018). The reward for pulling an arm can now be possibly spread adversarially over multiple time steps in the future. However they made an added assumption on the nonstochasticity of the rewards from each arm, thereby observing the same total reward for pulling the same arm each time. This scenario of nonstochastic composite anonymous feedback (CAF) can be applied to several applications, but it still does not cover the entire spectrum of applications.
Consider a setting of a clinical trial where the benefits of different medicines on improving patient health are observed. CAF offers a more natural extension to this scenario than DAAF since the benefits from a medicine can be spread over multiple steps after taking it rather than achieving it all at once at a single time step in the future. However, the improvements effects of the medicine might be different for different patients and thus assuming the same total health improvement for each time using a specific medicine is not very realistic. Inspired from this real world setting, we suggest that a more general bandit setting will be using CAF with the nonstochastic assumption dropped. We study such a MAB setting with stochastic delayed composite anonymous feedback (SDCAF).
In our model, a player interacts with an environment of K actions (or arms) in a sequential fashion. At each time step the player selects an action which leads to a reward generated at random from an underlying reward distribution which is spread over a fixed number of time steps after pulling the arm. More precisely, we assume that the loss for choosing an action at a time is adversarially spread over at most d consecutive time steps in the future. At the end of each round, the player observes only the sum of all the rewards that arrive in that round. Crucially, the player does not know which of the past plays have contributed to this aggregated reward. Extending algorithms from the theoretical model of SDCAF to practical applications, involves obtaining guarantees on the rewards obtained from them. The goal is to maximize the cumulative reward from plays of the bandit, or equivalently to minimize the regret (the total difference between the reward of the optimal action and the actions taken).
We present two algorithms for solving this setting which involve running a modified version of the UCB algorithm Auer et al. (2002) in phases where the same arm is pulled multiple times in a particular phase. This is motivated by the aim to reduce the error in approximating the mean reward of a particular arm due to extra and missing reward components from adjacent arm pulls. We prove sublinear regret bounds for both these algorithms.
1.1 Related Work
Online learning with delayed feedback has been studied in the nonbandit setting by Weinberger and Ordentlich (2006); Mesterharm (2005); Langford et al. (2009); Joulani et al. (2013); Quanrud and Khashabi (2015); Joulani et al. (2016); Garrabrant et al. (2016) and in the bandit setting by Neu et al. (2010); Joulani et al. (2013); Mandel et al. (2015); CesaBianchi et al. (2016); Vernade et al. (2017); PikeBurke et al. (2018). Dudík et al. (2011) consider stochastic contextual bandits with a constant delay and Desautels et al. (2014) consider Gaussian Process bandits with a bounded stochastic delay. The general observation that delay causes an additive regret penalty in stochastic bandits and a multiplicative one in adversarial bandits is made in Joulani et al. (2013)
. The delayed composite loss function of our setting generalizes the composite loss function setting of
Dekel et al. (2014).2 Problem Definition
There are actions or arms in the set . Each action is associated with a reward distribution supported in , with mean . Let
be a random variable which stands for the total reward obtained on pulling arm
at time . is drawn from the distribution , and may be spread over a maximum of time steps in an adversarial manner. is defined by the sum of many components for , where denotes the reward obtained at time on pulling arm at time .Let denote the action chosen by the player at the beginning of round . If , then the player obtains reward component at time , at time , and so on until time . The reward that the player observes at time t is the combined reward obtained at time . which is the sum of past loss contributions, where for all and when .
3 Algorithms
We present two algorithms for this setting of SDCAF in Algorithm 1 and Algorithm 2 respectively. For Algorithm 2 we only specific the additional inputs and initialization over Algorithm 1. We first provide the intuition behind the algorithms and then provide a formal regret analysis.
Algorithm 1 is a modified version of the standard UCB algorithm and is run in phases where the same arm is pulled multiple times along with maintaining an upper confidence bound on the reward from each arm. More specifically, each phase consists of two steps. In step 1, the arm with maximum upper confidence bound is selected. In step 2, the selected arm is pulled times repeatedly. We track all time steps where arm is played till phase in the set
. The rewards obtained are used to update the running estimate of the arm mean
. The intuition behind running the algorithm in phases is to gather sufficient rewards from a single arm so as to have a good estimate of the arm mean reward. This helps us bound the error in our reward estimate due to extra rewards from the previous phase and missing rewards which seep into the next phase due to delay. For every phase of the algorithm, the selected arm is pulled for a fixed number of times . From our regret analysis, setting achieves sublinear regret.Algorithm 2 is a modified version improvedUCB algorithm from Auer and Ortner (2010) run in phases where a set of active arms is maintained, which is pruned based on the arm mean estimates. Each phase consists of two steps where in Step 1, each active arm is played repeatedly for steps. We track all time steps where arm was played in the first phases in the set . In Step 2, a new estimate of the arm mean reward is calculated as the average of the observations from time steps in . Arm is eliminated if the calculated estimate is smaller than . The number of times each arm is pulled depends on which is chosen such that the confidence bounds on the estimation error of
hold with a given probability. This algorithm is adapted from
PikeBurke et al. (2018) but we get rid of the bridge period from the original algorithm as it does not impair the validity of confidence bounds in our analysis.We now provide regret analysis for the algorithms and specify the choice of parameters and .
3.1 Regret Analysis for Algorithm 1
The regret analysis closely follows from that of the UCB algorithm described in Lattimore and Szepesvá (2018). Without loss of generality we assume that the first arm is optimal. Thus we have , and define . We assume that the algorithm runs for phases. Let denote the number of times arm is played till phase . We bound for each suboptimal arm . For this we show that the following good event holds with a high probability bound
Here, is the event that is never underestimated by the upper confidence bound of the first arm, while at the same time the upper confidence bound for the mean of arm , after observations are taken from this arm, is below the payoff of the optimal arm. We make a claim that if occurs, then . Since we always have , the following holds
Next we bound the probability of occurrence of the complement event .
Lemma 1.
If
is an unbiased estimator of
for phase, then the error from the estimated mean can be bound as where is the set of time steps when arm was played and .The proof of Lemma 1 follows from the fact that in each phase the missing and extra reward components can be paired up and the maximum difference that we can obtain between them is at most one. We use Lemma1 to bound and obtain . This gives us an upper bound on the number of times a suboptimal arm is played .
Theorem 1.
For the choice of , the regret of Algorithm 1 is bounded by .
3.2 Regret Analysis for Algorithm 2
The regret analysis for this algorithm is adapted from appendix F of PikeBurke et al. (2018). We first present a lemma to bound the difference between estimators for the arm mean reward.
Lemma 2.
If is an unbiased estimator for in phase, then we can bound the difference of with the estimator used in Algorithm 2 as where each arm is pulled times till phase , is the set of time steps when arm was played and .
The proof proceeds with a similar argument as for Lemma 1. Details can be found in Appendix B.
Choice of : We use Algorithm 2 with for some large constants . The exact expression is given in Appendix B. We use Lemma2 to bound the probability that a suboptimal arm still remains in the set of active arms .
Lemma 3.
For the above choice of the estimates satisfy the following property: For every arm and phase , with probability either or .
Theorem 2.
For the choice of the regret of Algorithm 2 is bounded by .
The proof of Theorem 2 closely follows the analysis of improved UCB from Auer and Ortner (2010) using Lemma 3. Each suboptimal arm is eliminated in phase which a contributes a regret term of at most . We use our choice of and sum over all suboptimal arms to get the mentioned sublinear regret bound. We refer the readers to Appendix B for the detailed regret analysis of Algorithm 2 and proofs of Lemma 2, 3 and Theorem 2.
4 Conclusion and Future Work
We have studied an extension of the multiarmed bandit problem to stochastic bandits with delayed, composite, anonymous feedback. This setting is considerably difficult since the rewards are stochastically generated and spread in an adversarial fashion over time steps in the future which makes it hard to identify the optimal arm. We show that, surprisingly, it is possible to develop a simple phase based extension of the standard UCB algorithm that performs comparable to those for the simpler delayed feedback setting, where the assignment of rewards to arm plays is known. We suggest two possible directions of extending our work: the first being when the delay parameter is not perfectly known and the second being extending the setting to involve contextual bandits.
References
 Finitetime analysis of the multiarmed bandit problem. Mach. Learn. 47 (23), pp. 235–256. External Links: ISSN 08856125, Link, Document Cited by: §1.
 UCB revisited: improved regret bounds for the stochastic multiarmed bandit problem. Periodica Mathematica Hungarica 61 (1), pp. 55–65. External Links: ISSN 15882829, Document, Link Cited by: Appendix B, §3.2, §3.
 Delay and cooperation in nonstochastic bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, pp. 605–622. External Links: Link Cited by: §1.1.
 Nonstochastic bandits with composite anonymous feedback. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 750–773. External Links: Link Cited by: §1.
 Online learning with composite loss functions. CoRR abs/1405.4471. External Links: Link, 1405.4471 Cited by: §1.1.
 Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research 15, pp. 4053–4103. External Links: Link Cited by: §1.1.
 Efficient optimal learning for contextual bandits. CoRR abs/1106.2369. External Links: Link, 1106.2369 Cited by: §1.1.
 Asymptotic convergence in online learning with unbounded delays. CoRR abs/1604.05280. External Links: Link, 1604.05280 Cited by: §1.1.
 Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1453–1461. External Links: Link Cited by: §1.1.

Delaytolerant online convex optimization: unified analysis and adaptivegradient algorithms.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, AAAI’16, pp. 1744–1750. External Links: Link Cited by: §1.1.  Slow learners are fast. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems, NIPS’09, USA, pp. 2331–2339. External Links: ISBN 9781615679119, Link Cited by: §1.1.
 Bandit algorithms. Cambridge University Press, 2018. External Links: Link Cited by: §3.1.

The queue method: handling delay, heuristics, prior data, and evaluation in bandits
. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, January 2530, 2015, Austin, Texas, USA., pp. 2849–2856. External Links: Link Cited by: §1.1.  Online learning with delayed label feedback. In ALT, Cited by: §1.1.

Online markov decision processes under bandit feedback
. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 69 December 2010, Vancouver, British Columbia, Canada., pp. 1804–1812. External Links: Link Cited by: §1.1.  Bandits with delayed, aggregated anonymous feedback. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4105–4113. External Links: Link Cited by: Appendix B, §1.1, §1, §3.2, §3.
 Online learning with adversarial delays. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1270–1278. External Links: Link Cited by: §1.1.
 Stochastic bandit models for delayed conversions. In Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 1115, 2017, External Links: Link Cited by: §1.1.
 On delayed prediction of individual sequences. IEEE Trans. Inf. Theor. 48 (7), pp. 1959–1976. External Links: ISSN 00189448, Link, Document Cited by: §1.1.
Appendix A Regret Analysis for Algorithm 1
Let , represent the means of the reward distributions . Without loss of generality we assume that the first arm is optimal so that . We define . Algorithm 1 runs in phases of pulling the same arm for time steps and thus the regret over phases can be written as
(1) 
Where denotes number of times arm was played in phases. We bound the for each suboptimal arm . Let to be a good event for each arm defined as follows
where is a constant to be chosen later. So is the event that is never underestimated by the upper confidence bound of the first arm, while at the same time the upper confidence bound for the mean of the arm after observations are taken from this arm is below the payoff of the optimal arm. Two things are shown :

If occurs, then .

The complement event occurs with low probability.
Since we always have , following holds
(2) 
Next step we assume that holds and show that . Suppose . Then arm was played more that times over phases, so there must exist a phase where and . But using the definition of , we have . Hence , which is a contradiction. Therefore if occurs, then .
Now we bound . The event is as follows:
(3) 
Proof.
Consider the following estimator for the mean of the rewards generated from arm till phases :
where . It can be seen .
If arm was played in phase , then we have
where is the delay parameter over which the rewards are distributed. Because the missing and extra reward components can be paired up and the maximum difference we can obtain is at most one.
After phases, suppose an arm was played times. Then we can bound
(4) 
since in each phase an arm is pulled times. This gives and . ∎
Plugging this in our bound for gives
We then choose such that following holds for all
Since , is selected as
(5) 
Using this choice of and the fact that rewards are obtained from distributions which are subgaussian, we bound further as follows
(6) 
The next step is to bound the probability of term in (3). Note . Using (4) we get
Because of our choice of in (5) we have
Now we show that can be chosen in some sense such that following inequality holds
(7) 
We assume that arm is played in number of phases. Hence . This gives us that we can choose . Using this choice of and the subgaussian assumption we can bound
(8) 
When substituted in (2) we obtain
(9) 
Making the assumption that and the choice of from (7), equation (9) leads to
(10) 
All that remains to be chosen is . We choose somewhat arbitrarily such that , and accordingly choose such that last term in (10) does not contribute in polynomial dependence. We choose , so that . This leads to
(11) 
See 1
Proof.
From (11) we have that for each suboptimal arm we can bound
Therefore using the basic regret decomposition again, we have
where the first inequality follows because and the last by choosing . The term can be upper bounded by since each . Thus we get the regret bound . ∎
Appendix B Regret Analysis for Algorithm 2
See 2
Proof.
Since the rewards are spread over time steps in an adversarial way, in the worst case the first rewards collected for arm in phase would have components from previous arms. Similarly for the last arm pulls, the reward components would seep into the next arm pull. Defining and as the first and last points of playing arm in phase , we have
(12) 
because we can pair up some of the missing and extra reward components, and in each pair the difference is at most one. Then since and using (12) we get
(13) 
Define and recall that , where . ∎
See 3
Proof.
For any ,
where the first inequality is from triangle inequality and the last from Hoeffding’s inequality since are independent samples from , the reward distribution of arm . In particular choosing guarantees that .
Setting
ensures that . ∎
See 2