Multi-Armed Bandits (MAB) have been a well studied problem in machine learning theory for capturing the exploration-exploitation trade off in online decision making. MAB has applications to domains like e-commerce, computational advertising, clinical trials, recommendation systems, etc.
In most of the real world applications, assumptions of the original theoretical MAB model like immediate rewards, non-stochasticity of the rewards, etc do not hold. A more natural setting is when the rewards of pulling bandit arms are delayed in the future since the effects of the actions are not always immediately observed. Pike-Burke et al. (2018) first explored this setting assuming stochastic rewards for pulling an arm which are obtained at some specific time step in the future. This setting is called delayed, aggregated, anonymous feedback (DAAF). The complexity of this problem stems from anonymous feedback to the model due to its inability to distinguish the origin of rewards obtained at a particular time from any of the previous time steps.
This work was extended by adding a relaxation to the temporal specificity of observing the reward at one specific time in the future by Cesa-Bianchi et al. (2018). The reward for pulling an arm can now be possibly spread adversarially over multiple time steps in the future. However they made an added assumption on the non-stochasticity of the rewards from each arm, thereby observing the same total reward for pulling the same arm each time. This scenario of non-stochastic composite anonymous feedback (CAF) can be applied to several applications, but it still does not cover the entire spectrum of applications.
Consider a setting of a clinical trial where the benefits of different medicines on improving patient health are observed. CAF offers a more natural extension to this scenario than DAAF since the benefits from a medicine can be spread over multiple steps after taking it rather than achieving it all at once at a single time step in the future. However, the improvements effects of the medicine might be different for different patients and thus assuming the same total health improvement for each time using a specific medicine is not very realistic. Inspired from this real world setting, we suggest that a more general bandit setting will be using CAF with the non-stochastic assumption dropped. We study such a MAB setting with stochastic delayed composite anonymous feedback (SDCAF).
In our model, a player interacts with an environment of K actions (or arms) in a sequential fashion. At each time step the player selects an action which leads to a reward generated at random from an underlying reward distribution which is spread over a fixed number of time steps after pulling the arm. More precisely, we assume that the loss for choosing an action at a time is adversarially spread over at most d consecutive time steps in the future. At the end of each round, the player observes only the sum of all the rewards that arrive in that round. Crucially, the player does not know which of the past plays have contributed to this aggregated reward. Extending algorithms from the theoretical model of SDCAF to practical applications, involves obtaining guarantees on the rewards obtained from them. The goal is to maximize the cumulative reward from plays of the bandit, or equivalently to minimize the regret (the total difference between the reward of the optimal action and the actions taken).
We present two algorithms for solving this setting which involve running a modified version of the UCB algorithm Auer et al. (2002) in phases where the same arm is pulled multiple times in a particular phase. This is motivated by the aim to reduce the error in approximating the mean reward of a particular arm due to extra and missing reward components from adjacent arm pulls. We prove sub-linear regret bounds for both these algorithms.
1.1 Related Work
Online learning with delayed feedback has been studied in the non-bandit setting by Weinberger and Ordentlich (2006); Mesterharm (2005); Langford et al. (2009); Joulani et al. (2013); Quanrud and Khashabi (2015); Joulani et al. (2016); Garrabrant et al. (2016) and in the bandit setting by Neu et al. (2010); Joulani et al. (2013); Mandel et al. (2015); Cesa-Bianchi et al. (2016); Vernade et al. (2017); Pike-Burke et al. (2018). Dudík et al. (2011) consider stochastic contextual bandits with a constant delay and Desautels et al. (2014) consider Gaussian Process bandits with a bounded stochastic delay. The general observation that delay causes an additive regret penalty in stochastic bandits and a multiplicative one in adversarial bandits is made in Joulani et al. (2013)
. The delayed composite loss function of our setting generalizes the composite loss function setting ofDekel et al. (2014).
2 Problem Definition
There are actions or arms in the set . Each action is associated with a reward distribution supported in , with mean . Let
be a random variable which stands for the total reward obtained on pulling armat time . is drawn from the distribution , and may be spread over a maximum of time steps in an adversarial manner. is defined by the sum of -many components for , where denotes the reward obtained at time on pulling arm at time .
Let denote the action chosen by the player at the beginning of round . If , then the player obtains reward component at time , at time , and so on until time . The reward that the player observes at time t is the combined reward obtained at time . which is the sum of past loss contributions, where for all and when .
We present two algorithms for this setting of SDCAF in Algorithm 1 and Algorithm 2 respectively. For Algorithm 2 we only specific the additional inputs and initialization over Algorithm 1. We first provide the intuition behind the algorithms and then provide a formal regret analysis.
Algorithm 1 is a modified version of the standard UCB algorithm and is run in phases where the same arm is pulled multiple times along with maintaining an upper confidence bound on the reward from each arm. More specifically, each phase consists of two steps. In step 1, the arm with maximum upper confidence bound is selected. In step 2, the selected arm is pulled times repeatedly. We track all time steps where arm is played till phase in the set
. The rewards obtained are used to update the running estimate of the arm mean. The intuition behind running the algorithm in phases is to gather sufficient rewards from a single arm so as to have a good estimate of the arm mean reward. This helps us bound the error in our reward estimate due to extra rewards from the previous phase and missing rewards which seep into the next phase due to delay. For every phase of the algorithm, the selected arm is pulled for a fixed number of times . From our regret analysis, setting achieves sub-linear regret.
Algorithm 2 is a modified version improved-UCB algorithm from Auer and Ortner (2010) run in phases where a set of active arms is maintained, which is pruned based on the arm mean estimates. Each phase consists of two steps where in Step 1, each active arm is played repeatedly for steps. We track all time steps where arm was played in the first phases in the set . In Step 2, a new estimate of the arm mean reward is calculated as the average of the observations from time steps in . Arm is eliminated if the calculated estimate is smaller than . The number of times each arm is pulled depends on which is chosen such that the confidence bounds on the estimation error of
hold with a given probability. This algorithm is adapted fromPike-Burke et al. (2018) but we get rid of the bridge period from the original algorithm as it does not impair the validity of confidence bounds in our analysis.
We now provide regret analysis for the algorithms and specify the choice of parameters and .
3.1 Regret Analysis for Algorithm 1
The regret analysis closely follows from that of the UCB algorithm described in Lattimore and Szepesvá (2018). Without loss of generality we assume that the first arm is optimal. Thus we have , and define . We assume that the algorithm runs for phases. Let denote the number of times arm is played till phase . We bound for each sub-optimal arm . For this we show that the following good event holds with a high probability bound
Here, is the event that is never underestimated by the upper confidence bound of the first arm, while at the same time the upper confidence bound for the mean of arm , after observations are taken from this arm, is below the payoff of the optimal arm. We make a claim that if occurs, then . Since we always have , the following holds
Next we bound the probability of occurrence of the complement event .
is an unbiased estimator of
is an unbiased estimator offor phase, then the error from the estimated mean can be bound as where is the set of time steps when arm was played and .
The proof of Lemma 1 follows from the fact that in each phase the missing and extra reward components can be paired up and the maximum difference that we can obtain between them is at most one. We use Lemma-1 to bound and obtain . This gives us an upper bound on the number of times a sub-optimal arm is played .
For the choice of , the regret of Algorithm 1 is bounded by .
3.2 Regret Analysis for Algorithm 2
The regret analysis for this algorithm is adapted from appendix F of Pike-Burke et al. (2018). We first present a lemma to bound the difference between estimators for the arm mean reward.
If is an unbiased estimator for in phase, then we can bound the difference of with the estimator used in Algorithm 2 as where each arm is pulled times till phase , is the set of time steps when arm was played and .
The proof proceeds with a similar argument as for Lemma 1. Details can be found in Appendix B.
Choice of : We use Algorithm 2 with for some large constants . The exact expression is given in Appendix B. We use Lemma-2 to bound the probability that a suboptimal arm still remains in the set of active arms .
For the above choice of the estimates satisfy the following property: For every arm and phase , with probability either or .
For the choice of the regret of Algorithm 2 is bounded by .
The proof of Theorem 2 closely follows the analysis of improved UCB from Auer and Ortner (2010) using Lemma 3. Each sub-optimal arm is eliminated in phase which a contributes a regret term of at most . We use our choice of and sum over all sub-optimal arms to get the mentioned sub-linear regret bound. We refer the readers to Appendix B for the detailed regret analysis of Algorithm 2 and proofs of Lemma 2, 3 and Theorem 2.
4 Conclusion and Future Work
We have studied an extension of the multi-armed bandit problem to stochastic bandits with delayed, composite, anonymous feedback. This setting is considerably difficult since the rewards are stochastically generated and spread in an adversarial fashion over time steps in the future which makes it hard to identify the optimal arm. We show that, surprisingly, it is possible to develop a simple phase based extension of the standard UCB algorithm that performs comparable to those for the simpler delayed feedback setting, where the assignment of rewards to arm plays is known. We suggest two possible directions of extending our work: the first being when the delay parameter is not perfectly known and the second being extending the setting to involve contextual bandits.
- Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47 (2-3), pp. 235–256. External Links: Cited by: §1.
- UCB revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61 (1), pp. 55–65. External Links: Cited by: Appendix B, §3.2, §3.
- Delay and cooperation in nonstochastic bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pp. 605–622. External Links: Cited by: §1.1.
- Nonstochastic bandits with composite anonymous feedback. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 750–773. External Links: Cited by: §1.
- Online learning with composite loss functions. CoRR abs/1405.4471. External Links: Cited by: §1.1.
- Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research 15, pp. 4053–4103. External Links: Cited by: §1.1.
- Efficient optimal learning for contextual bandits. CoRR abs/1106.2369. External Links: Cited by: §1.1.
- Asymptotic convergence in online learning with unbounded delays. CoRR abs/1604.05280. External Links: Cited by: §1.1.
- Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1453–1461. External Links: Cited by: §1.1.
Delay-tolerant online convex optimization: unified analysis and adaptive-gradient algorithms.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 1744–1750. External Links: Cited by: §1.1.
- Slow learners are fast. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems, NIPS’09, USA, pp. 2331–2339. External Links: Cited by: §1.1.
- Bandit algorithms. Cambridge University Press, 2018. External Links: Cited by: §3.1.
The queue method: handling delay, heuristics, prior data, and evaluation in bandits. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pp. 2849–2856. External Links: Cited by: §1.1.
- On-line learning with delayed label feedback. In ALT, Cited by: §1.1.
Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., pp. 1804–1812. External Links: Cited by: §1.1.
- Bandits with delayed, aggregated anonymous feedback. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4105–4113. External Links: Cited by: Appendix B, §1.1, §1, §3.2, §3.
- Online learning with adversarial delays. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1270–1278. External Links: Cited by: §1.1.
- Stochastic bandit models for delayed conversions. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017, External Links: Cited by: §1.1.
- On delayed prediction of individual sequences. IEEE Trans. Inf. Theor. 48 (7), pp. 1959–1976. External Links: Cited by: §1.1.
Appendix A Regret Analysis for Algorithm 1
Let , represent the means of the reward distributions . Without loss of generality we assume that the first arm is optimal so that . We define . Algorithm 1 runs in phases of pulling the same arm for time steps and thus the regret over phases can be written as
Where denotes number of times arm was played in phases. We bound the for each sub-optimal arm . Let to be a good event for each arm defined as follows
where is a constant to be chosen later. So is the event that is never underestimated by the upper confidence bound of the first arm, while at the same time the upper confidence bound for the mean of the arm after observations are taken from this arm is below the payoff of the optimal arm. Two things are shown :
If occurs, then .
The complement event occurs with low probability.
Since we always have , following holds
Next step we assume that holds and show that . Suppose . Then arm was played more that times over phases, so there must exist a phase where and . But using the definition of , we have . Hence , which is a contradiction. Therefore if occurs, then .
Now we bound . The event is as follows:
Using a union bound the probability of term of can be upper bounded as
Consider the following estimator for the mean of the rewards generated from arm till phases :
where . It can be seen .
If arm was played in phase , then we have
where is the delay parameter over which the rewards are distributed. Because the missing and extra reward components can be paired up and the maximum difference we can obtain is at most one.
After phases, suppose an arm was played times. Then we can bound
since in each phase an arm is pulled times. This gives and . ∎
Plugging this in our bound for gives
We then choose such that following holds for all
Since , is selected as
Using this choice of and the fact that rewards are obtained from distributions which are subgaussian, we bound further as follows
Because of our choice of in (5) we have
Now we show that can be chosen in some sense such that following inequality holds
We assume that arm is played in number of phases. Hence . This gives us that we can choose . Using this choice of and the sub-gaussian assumption we can bound
When substituted in (2) we obtain
All that remains to be chosen is . We choose somewhat arbitrarily such that , and accordingly choose such that last term in (10) does not contribute in polynomial dependence. We choose , so that . This leads to
From (11) we have that for each sub-optimal arm we can bound
Therefore using the basic regret decomposition again, we have
where the first inequality follows because and the last by choosing . The term can be upper bounded by since each . Thus we get the regret bound . ∎
Appendix B Regret Analysis for Algorithm 2
Since the rewards are spread over time steps in an adversarial way, in the worst case the first rewards collected for arm in phase would have components from previous arms. Similarly for the last arm pulls, the reward components would seep into the next arm pull. Defining and as the first and last points of playing arm in phase , we have
because we can pair up some of the missing and extra reward components, and in each pair the difference is at most one. Then since and using (12) we get
Define and recall that , where . ∎
For any ,
where the first inequality is from triangle inequality and the last from Hoeffding’s inequality since are independent samples from , the reward distribution of arm . In particular choosing guarantees that .
ensures that . ∎