1 Introduction
The problem of standard multiarmed bandit (MAB) is well studied in machine learning
Auer (2000); Vermorel and Mohri (2005) and used to model online decisionmaking problems under uncertainty. Due to their implicit explorationvsexploitation tradeoff, they bandits are able to model clinical treatment, movie recommendations, retail management job scheduling etc., where the goal is to keep pulling the ‘bestitem’ in the hindsight through sequentially querying one item at a time and subsequently observing a noisy reward feedback of the queried arm EvenDar et al. (2006); Auer et al. (2002a); Auer (2002); Agrawal and Goyal (2012); Bubeck et al. (2012). However in various real world applications, the decision space (set of arms ) often changes over time due to unavailability of some items etc. For instance, in retail stores some items might go out of stock, on a certain day some websites could be down, some restaurants might be closed etc. This setting is known as sleeping bandits in online learning Kanade et al. (2009); Neu and Valko (2014); Kanade and Steinke (2014); Kale et al. (2016), where at any round the set of available actions could vary stochastically based on some unknown distributions over Neu and Valko (2014); Cortes et al. (2019) or adversarially Kale et al. (2016); Kleinberg et al. (2010); Kanade and Steinke (2014). Besides the reward model of the set of available actions could also vary stochastically or adversarially Kanade et al. (2009); Neu and Valko (2014). The problem is known to be NPhard when both rewards and availabilities are adversarial Kleinberg et al. (2010); Kanade and Steinke (2014); Kale et al. (2016). In case of stochastic reward and adversarial availabilities the achievable regret lower bound is known be , being the number of actions in the decision space . The well studied EXP algorithm does achieve the above optimal regret bound, although it is computationally inefficient Kleinberg et al. (2010); Kale et al. (2016). However, in the best known efficient algorithm only guarantees an regret,^{1}^{1}1 notation hides the logarithmic dependencies. which is not matching the lower bound both in and Neu and Valko (2014).In this paper we aim to give computationally efficient and optimal algorithms for the problem of sleeping bandits with stochastic reward and adversarial availabilities. Our specific contributions are as follows.
Contributions
Organization.
We introduce the formal problem statement in Sec. 2. Sec. 3 presents our results when availabilities of each items are independent of each other. A more general regret analysis is provided in Sec. 3. Our experimental evaluations are given in Sec. 5. We finally conclude the paper in Sec. 6 with some future scopes.
2 Problem Statement
Notation. We denote by .
denotes the indicator random variable which takes value
if the predicate is true and otherwise. notation is used to hide logarithmic dependencies.2.1 Setup
Suppose the decision space (or set of actions) is with distinct actions, and we consider the now round sequential game. At each time step , the learner is presented with a set of available actions at round , say , upon which the learner’s task is to play an action and consequently suffer a loss , where denotes the loss of the actions chosen obliviously independent of the available actions at time . We consider the following two types of availabilities:
Independent Availabilities. In this case we assume that the availability of each item is independent of the rest , such that at each round item is drawn in set
with probability
, or in other words, for all item , , where availability probabilities are fixed over time intervals , independent of each other, and unknown to the learner.General Availabilities. In this case each s are drawn iid from some unknown distribution over subsets with no further assumption made on the properties of . We denote by the probability of occurrence of set .
2.2 Objective
We define by a policy to be a mapping from a set of available actions/experts to an item.
Regret definition The performance of the learner, measured with respect to the best policy in the hindsight, is defined as:
(1) 
where the expectation is taken w.r.t. the availabilities and the randomness of the player’s strategy.
The regret lower bound of the above objective is known to be , however, to the best of our knowledge, there is no existing algorithm which are known to achieve this optimal rate, which is the primary objective of this work.
3 Proposed algorithm: Independent Availabilities
In this section we propose our first algorithm for the problem (Sec. 2), which is based on a variant of EXP3 algorithm with a ‘suitable’ loss estimation technique. Thm. 2 proves the optimality of its regret performance.
Algorithm description. Similar to EXP algorithm, at every round
we maintain a probability distribution
over the arm set and also the empirical availability of each itemUpon receiving the available set the algorithm redistributes only on the set of available items , say , and plays an item . Subsequently environment reveals the loss , and we update the distribution using exponential weight on the loss estimates for all
(2) 
where is a scale parameter and (see definition (3)) is an estimation of the probability of playing arm at time under the joint uncertainty in availability of (due to ) and the randomness of EXP3 algorithm (due to ).
New insight compared to existing algorithms. It is crucial to note that one of our main contributions lies in the loss estimation technique in (2). The standard loss estimates used by EXP3 (see Auer et al. (2002b)) are of the form . Yet, because of the unavailable actions, the latter is biased. The solution proposed by Neu and Valko (2014) (see Sec. 4.3) consists in using unbiased loss estimates of the form where and are estimates for the availability probability and for the weight respectively. The suboptimal of their regret bound resulted from this separated estimation of and
, which leads to a high variance in the analysis because
whenever .While we circumvent this problem by estimating them jointly as
(3) 
where being the empirical probability of availability of the set , and for all
(4) 
being the redistributed mass of on support set . As shown in Lem. 1, is a good estimate for , which is the conditional probability of playing action at time . It turns out that is much more stable than and therefore implies better variance control in the regret analysis. This improvement finally leads to the optimal regret guarantee (Thm. 2). The complete algorithm is given in Alg. 1.
The first crucial result we derive towards proving Thm. 2 is the following concentration guarantees on :
Lemma 1 (Concentration of ).
Using the result of Lem. 1, the following theorem analyses the regret guarantee of SleepingEXP3 (Alg. 1).
Theorem 2 (SleepingEXP3: Regret Analysis).
Let . The sleeping regret incurred by SleepingEXP3 (Alg. 1) can be bounded as:
for the parameter choices , , and
Proof.
(sketch) Our proof is developed based on the standard regret guarantee of EXP3 algorithm for the classical problem of multiarmed bandits with adversarial losses Auer et al. (2002b); Auer (2002). Precisely, consider any fixed set , and suppose we run EXP3 algorithm on the set , over any nonnegative sequence of losses over items of set , and consequently with weight updates as per the EXP3 algorithm with learning rate . Then from the standard regret analysis of the EXP3 algorithm CesaBianchi and Lugosi (2006), we get that for any :
Let be any strategy. Then, applying the above regret bound to the choice and taking the expectation over and over the possible randomness of the estimated losses, we get
(6) 
Now towards proving the actual regret bound of SleepingEXP3 (recall the definition from Eqn. (1)), we first need to establish the following three main subresults that relate the different expectations of Inequality (6) with quantities related to the actual regret (in Eqn. (1)).
Lemma 3.
Lemma 4.
Given the above claims in place, we are now proceed to prove the main theorem: Let us denote the best policy . Now recalling from Eqn. (1), the actual regret definition of our proposed algorithm, and combining the claims from Lem. 3, 4, we first get:
Then, we can further upperbound the last term in the righthandside using Inequality (6) and Lem. 5, which yields
(7) 
where in the lastinequality we used that and . Otherwise, we can always choose instead of in the algorithm and Lem. 1 would still be satisfied.
The proof is concluded by replacing and by bounding the two sums as follows:
and using , we have
Then, using , , we can further upperbound: and Thus, upperbounding the two sums into (7), we get
Optimizing and upperbounding , finally concludes the proof. ∎
The above regret bound is of order which is optimal in , unlike any previous work which could only achieve regret guarantee Neu and Valko (2014) at best. Thus our regret guarantee is only suboptimal in terms of as the lower bound of this problem is known to be Kleinberg et al. (2010); Kanade et al. (2009). However, it should be noted that in our experiments (see Figure 4), the dependence of our regret on the number of arms behaves similarly to other algorithms although theirs theoretical guarantees expect better dependences on . The suboptimality could thus be an artifact of our analysis, but despite our efforts, we have not been able to improve it. We think this may come from our proof of Lem. 1 in which we see two gross inequalities that may cost us this dependence on . First, the proof upperbounds , while in average over and the latter is around . Yet dependence problems condemn us to use this worstcase upperbound. Second, the proof uses uniform bounds of over when the estimation errors could offset each other.
Note also that the regret bound in the theorem is worstcase. An interested direction for future work would be to study whether it is possible to derive an instancedependent bound, based on the instances. Typically, could be replaced by the expected number of active experts. A first step in this direction would be to start from Inequality (15) in the proof of Lem. 1 and try to keep the dependence on the distribution along the proof.
Finally, note that the algorithm only requires the beforehand knowledge of the horizon to tune its hyperparameter. However, the latter assumption can be removed by using standard calibration techniques such as the doubling trick (see CesaBianchi and Lugosi (2006)).
3.1 Efficient SleepingEXP3: Improving Computational Complexity
Thm. 2 shows the optimality of SleepingEXP3 (Alg. 1), but its one limitation lies in computing the probability estimates
which requires computational complexity per round .
In this section we show how to get around with this problem just by approximating by an empirical estimate
(8) 
where are independent draws from the distribution , i.e. for any (recall the notations from Sec. 3). The above trick proves useful with the crucial observation that are independent of each other (given the past) and that each
are unbiased estimated of
. That is, . By classical concentration inequalities, this precisely leads to fast concentration of to which in turn concentrates to (by Lem. 1). Combining these results, thus one can obtain the concentration of to as shown in Lem. 6.Lemma 6 (Concentration of ).
Remark 1.
Using the result of Lem. 6, we now derive the following theorem towards analyzing the regret guarantee of the computationally efficient version of SleepingEXP3.
Theorem 7 (SleepingEXP3 (Computationally efficient version): Regret Analysis).
Let . The sleeping regret incurred by the efficient approximation of SleepingEXP3 (Alg. 1) can be bounded as:
for the parameter choices , and .
Furthermore, the perround time and space complexities of the algorithm are and respectively.
Proof.
(sketch) The regret bound can be proved using similar steps as that described for Thm. 2, except now replacing the concentration result of Lem. 6 in place of Lem. 1.
Computational complexity: At any round , the algorithm requires only cost to update , and . Resampling subsets and computing requires another cost, resulting the claimed computational complexity.
Spatial complexity: We only need to keep track of and making the total storage complexity just (noting can be computed sequentially). ∎
4 Proposed algorithm: General Availabilities
Setting. In this section we assume general subset availabilities (see Sec. 2).
4.1 Proposed Algorithm: SleepingEXP3G
Main idea. By and large, we use the same EXP3 based algorithm as proposed for the case of independent availabilities, barring the only difference lies in using a different empirical estimate
(9) 
In the hindsight, the above estimate is equal to the expectation , i.e., , where is the empirical probability of set at time . The rest of the algorithm proceeds same as Alg. 1, the complete description is given in Alg. 2.
4.2 Regret Analysis
We first analyze the concentration of –the empirical probability of playing item at any round , and the result goes as follows:
Lemma 8 (Concentration of ).
Theorem 9 (SleepingEXP3G: Regret Analysis).
Let . Suppose we set , , and . Then, the regret incurred by SleepingEXP3G (Alg. 2) can be bounded as:
Furthermore, the perround space and time complexities of the algorithm are .
Proof.
(sketch) The proof proceeds almost similar to the proof of Thm. 2 except now the corresponding version of the main lemmas, aka. Lem. 3,4, and 5 are satisfied but for since here we need to use the concentration Lem. 8 instead of Lem. 1.
Similar to the proof of Thm. 2 and following the same notations, we first combine claims from Lem. 3, 4 to get:
where Inequality (a) and (b) respectively follows from (6) and Lem. 5. The last inequality holds since because and . To conclude the proof, it now only remains to compute the sums and to choose the parameters and . Using , we have and since , we further have
which entails:
Finally substituting and above bounds in the regret upperbound yields the desired result.
Complexity analysis. The only difference with Alg. 1 comes from the computation of . Following a similar argument given for proving the computation complexity of Thm. 7, this can also be performed with a computational cost of . Yet, now the algorithm specifically needs to keep in memory the empirical distribution of and thus a space complexity of is required. ∎
Our regret bound in Thm. 9 has the optimal dependency—to the best of our knowledge, SleepingEXP3G (Alg 2) is the first computationally efficient algorithm to achieve guarantee for the problem of SleepingBandits. Of course EXP4 algorithm is known to attain the optimal regret bound, however it is computationally infeasible Kleinberg et al. (2010) due to the overhead of maintaining combinatorially large policy class.
Yet, on the downside, it is worth pointing out that the regret bound of Thm. 9 only provides a sublinear regret in the regime , in which case algorithms such as EXP4 can be efficiently implemented. However, we still believe our algorithm to be an interesting contribution since it completes another side of the computationalperformance tradeoff. It is possible to move the exponential dependence on the number of experts from the computational complexity to the regret bound.
Another argument in favor of this algorithm is that it provides an efficient alternative algorithm to EXP4 with regret guarantees in the regime . In the other regime, though we could not prove any meaningful regret guarantee, Alg. 4.1 performs very well in practice as shown by our experiments. We believe the constant in the regret to be an artifact of our analysis. However, removing it seems to be highly challenging due to dependencies between with . An analysis of the concentration of (defined in (9)) to without exponential dependence on proved to be particularly complicated. We leave this question for future research.
5 Experiments
In this section we present the empirical evaluation of our proposed algorithms (Sec. 3 and 4) comparing their performances with the two existing sleeping bandit algorithms that applies to our problem setting, i.e. for adversarial losses and stochastic availabilities. Thus we report the comparative performances of the following algorithms:
Performance Measures.
In all cases, we report the cumulative regret of the algorithms for time steps, each averaged over runs.
Following subsections analyze our experimental evaluations for both independent and general (nonindependent) availabilities.
5.1 Independent Availabilities
In this case the items availabilities are assumed to be independent at each round (description in Sec. 2).
Environments.
We consider and generate the probabilities of item availabilities independently and uniformly at random from the interval . Recall we assumed the loss sequence to be oblivious to the availabilities, towards which we use the following loss generation techniques: Switching loss or SL(). We generate the loss sequence such that the best performing expert changes after every
length epochs.
Markov loss or ML(p). Similar to the setting used by Neu and Valko (2014), losses for each arm are constructed as random walks with Gaussian increments of standard deviation
, initialized uniformly on such that losses outside are truncated. The explicit values used for and are specified in the corresponding figures.Remarks.
From Fig. 1 it clearly shows that regret bounds of our proposed algorithms SleepingEXP3 and SleepingEXP3G outperform the rest two due to their orderwise optimal regret performance (see Thm. 2 and 9). In particular, BanditSFPL performs the worst due to their initial exploration rounds and uniform exploration phases there after. SleepingCat gives much competitive regret bound compared to BanditSFPL however still lags behind due to the regret guarantee (see Sec. for a detailed explanation).
5.2 Regret vs Varying Availabilities.
We next conduct a set of experiments to compare the regret performances of the algorithms with varying availability probabilities: For this we assign same availability to every item for and plot the final cumulative regret of each algorithm.
Remarks.
From Fig. 2, we again note that our proposed algorithms outperform the rest two by a large margin for almost every . The performance of BSFPL is worse, it steadily decreases with increasing availability probability due to the explicit exploration rounds in the initial phase of BSFPL, and even thereafter it keeps on suffering the loss of the uniform policy scaled by the exploration probability.
5.3 Correlated (General) Availabilities
We now assess the performances when the availabilities of items are dependent (description in Sec. 2).
Environments.
To enforce dependencies of item availabilities we generate each set by drawing a random sample from Gaussian such that , and is some random positive definite matrix, e.g. block diagonal matrix with strong correlation among certain group of items. To generate the loss sequences, we use the similar techniques described in Sec. 5.1.
Remarks.
From Fig. 3 one can again verify the superior performance of our algorithms over SleepingCat and BanditSFPL, however the effect only visible for large as for smaller time steps , terms dominates the regret performance, but as shoots higher our optimal rate outperforms the suboptimal and rates of SleepingCat and BanditSFPL respectively.
5.4 Regret vs Varying Itemsize .
Finally we also conduct a set of experiments changing the item set size over a wide range ( to ). We report the final cumulative regret of all algorithms vs for different switching loss sequence for both independent and general availabilities, as specified in Fig. 4.
Remark.
Fig. 4 shows that regret of each algorithm increases with , as expected. As before the rest of the two baselines performs suboptimally in comparison to our algorithms, however the interesting thing to note is the relative performance of SleepingEXP3 and SleepingEXP3G—as per Thm. 2 and 9, SleepingEXP3 must outperform SleepingEXP3G with increasing however the effect does not seem to be so drastic experimentally, possibly revealing the scope of improving Thm. 9 in terms of a better dependency in .
6 Conclusion and Future Work
We have presented a new approach that brought an improved rate for the setting of sleeping bandits with adversarial losses and stochastic availabilities including both minimax and instancedependence guarantees. While our bounds guarantee a regret of there are several open question before the studied setting can be considered as closed. First, for the case of independent availabilities, we provide a regret guarantee of , leaving open whether is possible as in the standard nonsleeping setting. Second, while we provided computationally efficient (i.e., with perround complexity of order ) SleepingEXP3, for the case of general availabilities and provided instance dependent regret guarantees for it, the worst case regret guarantee still amounts to Therefore, it is still unknown if for the general availabilities we can get an algorithm that would be both computationally efficient and have regret guarantee in the worst case. We would like to point out that the new techniques could be potentially used to provide new algorithms and guarantees in settings with similar challenges as in sleeping bandits, such as rotting or dying bandits. Finally, having algorithms for sleeping bandits with regret guarantees open a way to deal with sleeping constraints in more challenging structured bandits with large or infinite number of arms and having the regret guarantee depend not on number of arms but rather some effective dimension of the arms’ space.
Acknowledgements
The research presented was supported French National Research Agency project BOLD (ANR19CE23002604) and by European CHISTERA project DELTA. We also wish to thank Antoine Chambaz and MarieHélène Gbaguidi.
References

Agrawal and Goyal [2012]
Shipra Agrawal and Navin Goyal.
Analysis of thompson sampling for the multiarmed bandit problem.
In Conference on Learning Theory, pages 39–1, 2012.  Auer [2000] Peter Auer. Using upper confidence bounds for online learning. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 270–279. IEEE, 2000.
 Auer [2002] Peter Auer. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
 Auer et al. [2002a] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002a.
 Auer et al. [2002b] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
 Bubeck et al. [2012] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 CesaBianchi and Lugosi [2006] Nicolo CesaBianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 Cortes et al. [2019] Corinna Cortes, Giulia Desalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning with sleeping experts and feedback graphs. In International Conference on Machine Learning, pages 1370–1378, 2019.

EvenDar et al. [2006]
Eyal EvenDar, Shie Mannor, and Yishay Mansour.
Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems.
Journal of machine learning research, 7(Jun):1079–1105, 2006. 
Kale et al. [2016]
Satyen Kale, Chansoo Lee, and Dávid Pál.
Hardness of online sleeping combinatorial optimization problems.
In Advances in Neural Information Processing Systems, pages 2181–2189, 2016.  Kanade and Steinke [2014] Varun Kanade and Thomas Steinke. Learning hurdles for sleeping experts. ACM Transactions on Computation Theory (TOCT), 6(3):11, 2014.
 Kanade et al. [2009] Varun Kanade, H Brendan McMahan, and Brent Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. 2009.
 Kleinberg et al. [2010] Robert Kleinberg, Alexandru NiculescuMizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(23):245–272, 2010.
 Neu and Valko [2014] Gergely Neu and Michal Valko. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, pages 2780–2788, 2014.
 Vermorel and Mohri [2005] Joannes Vermorel and Mehryar Mohri. Multiarmed bandit algorithms and empirical evaluation. In European conference on machine learning, pages 437–448. Springer, 2005.
Supplementary: Improved Sleeping Bandits with Stochastic Actions Sets
and Adversarial Rewards
Appendix A Appendix for Sec. 3
a.1 Proof of Lem. 1
See 1
Proof.
Let and . We start by noting the concentration of to for all . By Bernstein’s inequality together with a union bound over : with probability at least , for all
(10) 
Then, , using the definitions of and , we get
Comments
There are no comments yet.