Online learning is a branch of machine learning that is concerned with the problem of dynamically optimizing utility (or loss) over time in the face of uncertainty, and gives valuable principles to reason about acting under uncertainty. The study of online learning has developed along two concrete lines insofar as modeling the uncertain environment is concerned. On one hand, there is a rich body of work on learning in stochastic environments from an average-case point of view, such as i.i.d. multi-armed bandits (see for example the survey ofBubeck et al. (2012)
), online learning in Markov decision processes(Jaksch et al., 2010; Azar et al., 2017), stochastic partial monitoring (Bartók et al., 2011), etc., which often yields performance guarantees that are strong but can closely depend on the stochastic models at hand. On the other hand, much work has been devoted to studying non-stochastic (or arbitrary or adversarial) models of environments from a worst-case point of view– prediction with experts, bandits and partial monitoring problems to name a few (Cesa-Bianchi and Lugosi, 2006) – which naturally yields rather pessimistic guarantees.
Recent efforts have focused on bridging this spectrum of modeling structure in online learning problems as arising from non-stochastic environments with loss function sequences exhibiting adequate temporal regularity. These include the derivation of first-order regret bounds or adapting to loss sequences with low loss of the best action(Allenberg et al., 2006), second-order bounds or adapting to loss sequences with low variation in prediction with experts (Rakhlin and Sridharan, 2012; Steinhardt and Liang, 2014) and ‘benign’ multi-armed bandits (Hazan and Kale, 2011; Bubeck et al., 2019, 2017; Wei and Luo, 2018).
In this regard, this paper is an attempt to extend our understanding of adaptivity to low variation in several standard online learning problems where information comes at a cost, namely label efficient prediction (Cesa-Bianchi et al., 2005), label efficient bandits and classes of partial monitoring problems (Bartók et al., 2014). In the process, we uncover new ways of using existing online learning techniques within the Online Mirror Descent (OMD) family, and partially make progress towards a program of studying the impact of ‘easy’ (i.e., slowly-varying) environments in information-constrained online learning and partial monitoring problems. Our specific contributions are:
For the label efficient prediction game with expert advice, we give a learning algorithm with a regret bound of where is the quadratic variation of the best expert, is the time horizon of the game, is the number of experts and is the bound on label queries; the bound holds for all regimes except when . We follow this up with an algorithm with an unconditional regret guarantee of that holds for any label query budget and total quadratic variation . Our algorithms are based on the optimistic OMD framework, but with new combinations of the negative entropy and log-barrier regularization that are best suited to the label efficient game’s information structure.
We generalize the results to label efficient bandits where one receives bandit (i.e., for only the chosen expert) feedback at only up to chosen time instants, and obtain regret. We also show that our upper bounds on regret for label efficient prediction and label efficient bandits are tight in their dependence on and by demonstrating variation-dependent fundamental lower bounds on regret.
We show that adapting to low variation is also possible in the class of hard partial monitoring games as per the taxonomy of partial monitoring problems by Bartók et al. (2014), where we show an algorithm that achieves regret. To the best of our knowledge, this is the first algorithm exhibiting instance-dependent bounds for partial monitoring.
Problem Setup and Notation
A label efficient prediction game proceeds for rounds with arms or ‘experts’. In each round (time instant) , the learner selects an arm
. Simultaneously, the adversary chooses a loss vectorwhere is the loss of arm at time . At each round, the learner can additionally choose to observe the full loss vector , provided the number of times it has done so in the past has not exceeded a given positive integer that represents an information budget or constraint. We work in the oblivious adversarial setting where does not depend on the previous actions of the learner ; this is akin to the adversary fixing the (worst-possible) sequence of loss vectors in advance. The learner’s goal is to minimize its expected regret defined as
where the expectation is taken with respect to the learner’s randomness. Given a convex function over , we denote by the Bregman divergence with respect to defined as . For any point , we define the local norm at with respect to as and the corresponding dual norm as . We denote by , the fraction of time we are allowed the full loss vector i.e. . The can be seen as a way to model the constraint on information defined by the problem. The quadratic variation for a loss vector sequence is defined by with . Additionally, the quadratic variation of the best arm(s) is where and .
2 Key Ideas and Algorithms
The underlying framework behind our algorithms is that of Online Mirror Descent (OMD)(see, for exampleHazan (2016)). The vanilla update rule of (active) mirror descent can be written as . On the other hand, our updates are:
where , corresponds to optimistic111‘Optimistic’ is used to denote the fact that we would be best off if these estimates were exactly the upcoming loss. Indeed, if were , it would be equivalent to 1-step lookahead, known to yield low regret.estimates of the loss vectors (which we will also refer to as messages), and denotes a second order correction that we explicitly define later. Throughout the paper,
is used to denote an (unbiased) estimate ofthat the learner constructs at time . Optimistic OMD with second order corrections was first studied in Wei and Luo (2018), whereas its Follow-the-Regularized-Leader (FTRL) counterpart was introduced earlier by Steinhardt and Liang (2014). Both of these approaches build upon the general optimistic OMD framework of Rakhlin and Sridharan (2012) and Chiang et al. (2012). We define our updates with scaled losses and messages, where we reiterate that the scaling factor reflects the limitation on information. This scaling also impacts our second order corrections which are . It is worthwhile to note that this is explicitly different from the that one may expect in light of the analysis done in Wei and Luo (2018), or the one would anticipate when following Steinhardt and Liang (2014). One may argue that our update rules are equivalent to dividing throughout by , or put differently, by merging an into the step size, and this indeed true. However, the point we would like to emphasize is that no matter how one defines the updates, the second order correction can be seen to incorporate the problem dependent parameter . This tuning of the second order correction based on is different from what one observes for the full information problem (Steinhardt and Liang, 2014) or for bandits (Wei and Luo, 2018). The second order corrections represent a further penalty on arms which are deviating from their respective messages, and these corrections are what enable us to furnish best arm dependent bounds. As usual, the arm we play is still sampled from the distribution given by equation (1).
|Reference||Feedback||Negentropy:Log-barrier Regularizer Ratio Used|
|Bubeck et al. (2017)||Bandit|
|Wei and Luo (2018)||Bandit|
|Bubeck et al. (2019)||Bandit|
|Steinhardt and Liang (2014)||Full Information|
|This work||Label Efficient– Full Information|
|This work||Label Efficient– Bandit Feedback|
Challenges & Our Choice of Regularization
We briefly discuss the challenges posed by label efficient prediction and how our choice of regularizer addresses these. When shifting away from the classical prediction with expert advice problem to any limited
feedback (i.e., over experts or arms) information structure, one usually works with importance-weighted estimates of the loss vectors constructed using the observed (limited) feedback (called inverse propensity weighting estimation). This is indeed the case with label efficient prediction, however, the probabilities in the denominator remain fixed at, unlike in bandits where the in the denominator can be arbitrarily small.
Consequently, one may be led to believe that the standard negative entropic regularizer, as is typically used for full information (Steinhardt and Liang, 2014), will suffice for the more general but related label efficient prediction. However, maintaining the inequality which is standard in analyses similar to Exp3 imposes a strict bound of . Since the low quadratic variation, on the other hand, would encourage one to set an aggressive learning rate , this makes the applicability of the algorithm rather limited, and even then, with marginal gain. Put crisply, it is desirable that low quadratic variation should lead an algorithm to choose an aggressive learning rate, and negative entropy fails to maintain a ‘stability’ property (in the sense of Lemma 14), key in obtaining OMD regret bounds, in such situations. The log-barrier regularizer, used by Wei and Luo (2018) for bandit feedback certainly guarantees this, however using log-barrier blindly translates to a dependence on the number of arms .
These challenges place label efficient prediction with slowly varying losses in a unique position, as one requires enough curvature to ensure stability, yet not let this added curvature significantly hinder exploration. Our solution is to use a hybrid regularizer, that is, a weighted sum of the negative entropic regularizer and the log-barrier regularizer:
This regularizer has been of recent interest due to the work of Bubeck et al. (2019), and Bubeck et al. (2017), but the weights chosen for both components is highly application-specific and tends to reflect the nature of the problem. As reported above, we only require the log-barrier to guarantee stability, and therefore associate a small (roughly ) weight to it and a dominant mass of to negative entropy. This fact is revealed in the analysis where we use the log-barrier component solely to satisfy Lemmas 13 and 14, following which it is essentially dispensed. The additional factor part of the log-barrier weight is carefully chosen to exactly cancel the in the leading term generated by the log-barrier component, and consequently, not have a dependence on in the final regret bound.
When considering quadratic variation as a measure of adaptivity, a natural message to pass is the mean of the previous loss history, that is . However, the constraint on information prohibits us from having the full history, and we therefore have to settle for some estimate of the mean. Reservoir sampling, first used in Hazan and Kale (2011), solves this very problem. Specifically, by allocating roughly rounds for reservoir sampling (where we choose to be ), reservoir sampling gives us estimates such that , and . It does so by maintaining a carefully constructed reservoir of size , the elements from which are then averaged to output the estimate of the mean. Our message at any time is the average of the vectors contained in the reservoir .
2.1 Main Algorithm
Algorithm 1 builds upon the ideas presented above and as stated, is specifically for the label efficient prediction problem discussed thus far. The algorithms required for the extensions we provide in section 4 are based upon algorithm 1, although with a few minor differences. We specify those differences as and when required. Also, in the interest of brevity, we have excluded the explicit mentioning of the reservoir sampling steps. Before we proceed, we would like to cleanly state our choice of messages, loss estimates, and second order corrections used and this is done in Table 2. Our messages, for all the sections will be
. Note that throughout the paper, the random variablesignifies that we ask for feedback at time , and is otherwise. Additionally, note that we consider not exceeding the budget of in expectation, however, there is a standard reduction to get a high probability guarantee which can be found in Cesa-Bianchi and Lugosi (2006).
3 Results and Analysis
We now give a general regret result for the OMD updates (1) and (2). It spells out the condition we must maintain to ultimately enable best arm dependent bounds while also demonstrating the price of limited information on regret, which is the additional factor. The proofs for all results in this section appear in the supplementary material.
Note that when is employed in the updates (1)-(2), i.e., no second order corrections, the first term in (3) can directly be handled using Hölder’s inequality (in some norm where is strongly convex). Doing so allows us to cancel the unwanted term using the term in (which follows by strong convexity) while retaining the crucial variance term. However, with general second order corrections (), the key variance term is
as it corresponds to the best arm’s second moment under a suitably chosenand the responsibility of cancelling the entire first term of (3) now falls upon . Under limited information, negative entropy is unable to maintain this and we therefore have to incorporate the log barrier function (also see Lemma 1 in Wei and Luo (2018)). We now state our main result for adaptive label efficient prediction which bounds the regret of Algorithm 1.
For , , and where the sequence of messages are generated using the reservoir sampling scheme, the expected regret of algorithm 1 satisfies the following:
Furthermore, if , then with an optimal choice of .
Consider a concrete example of a game played for time , where we anticipate and . In this scenario, if we were to run the standard label efficient prediction algorithm as given in Cesa-Bianchi et al. (2005), we would get a regret bound of ; following an FTRL with negative entropy222As done in Steinhardt and Liang (2014) for prediction with experts-based strategy would be inapplicable in this setting due to the constraint we highlight in section 2, however, Algorithm 1 would incur regret – a marked improvement. Also, note that because of the full vector feedback, it is not required to allocate any rounds exclusively for reservoir sampling. This fact is reflected in not having to incur any additive penalty for reservoir sampling.
Theorem 2 is slightly restricted in scope, due to the lower bound required on , in its ability to attain the optimal regret scaling with quadratic variation. We now proceed to discuss what can be said without any constraint on . Specifically, we will provide an algorithm obtaining regret under all scenarios, the trade-off however being that we will be penalized by instead of . In settings where the condition does not hold and incurring regret in terms of is not unfavourable (as an extreme example, consider constant variation on all arms, with very limited feedback) the strategy below will certainly be of use. The algorithm, again based on OMD, foregoes second order corrections and has updates defined by:
Without second order corrections, the term can be folded into the regularizer and the updates reduce to the ones studied in Rakhlin and Sridharan (2012). For updates (5) and (6), we have the following analogue of Lemma 1, and then consequently, the analogue of Theorem 2. We include these here in the interest of completeness, but equivalent statements can be found in Rakhlin and Sridharan (2012).
For , , and , where the sequence of messages are generated using the reservoir sampling scheme, Algorithm 1 with yields:
Optimally tuning yields a bound.
Trying to deeper understand how the constraint of Theorem 2 can be sidestepped to yield a universal algorithm dependent on remains a direction of future interest.
Note that we have assumed knowledge of , and when optimising for the fixed step size in the above discussion. This is often not possible and we now briefly discuss the extent to which we can obtain parameter-free algorithms. In Theorem 5 we claim that we can choose adaptively for the dependent bound we present in Theorem 4 and discuss this in Appendix B333Note that similarly to Hazan and Kale (2011) we still assume knowledge of , but this can be circumvented using standard tricks.. It remains open whether a dependent bound (or in general, any non-monotone dependent bound) can be made parameter free for even the standard prediction with expert advice problem. The challenge is essentially that our primary tool to sidestep prior knowledge of a parameter, the doubling trick, is inapplicable for non-monotone quantities.
Even freeing algorithms from prior knowledge of non-decreasing arm dependent quantities, such as remains open for limited information setups (i.e. anything outside prediction with expert advice) due to the lack of a clear auxiliary term one can observe. In Algorithm 2
, we proceed in epochs (or rounds) such thatremains fixed per epoch. Denote by the value of in epoch . We will write for the first time instance in epoch .
4 Adapting to Slowly Varying Losses in Other Information-Constrained Games
We will now investigate exploiting the regularity of losses in a variety of other settings with implicit/explicit information constraints. We will first focus on bandit feedback, following which we will briefly discuss partial monitoring. The proofs for this section can be found in the supplementary material.
4.1 Label Efficient Bandits
The change here is in the feedback information the learner receives when asking for information. Instead of receiving the full loss vector, the learner now only receives the loss of the played arm , i.e. the th coordinate of . We will continue to use the same update rules (1) and (2) here. What will change most importantly is the regularizer which will now solely be the log barrier regularizer . Note that the coefficient of log barrier is also instead of the earlier . The loss estimates and second order corrections will also change and these are all mentioned in Table 2. We will now state the main theorem for label efficient bandits. Most of the analysis is similar to Theorem 2, but we do highlight the differences in Appendix C.1 in the supplementary material.
For , , and where the sequence of messages are given by reservoir sampling, the regret of algorithm 1 modified for label efficient bandits satisfies:
Note that since we are in the bandit feedback setting, we now reserve certain rounds solely for reservoir sampling. This is reflected in the additive term in regret. There are now rounds allotted to each of the arms, hence the term. There will also be a few minor changes in the algorithm primarily corresponding to the appropriate execution of reservoir sampling for bandit feedback.
4.2 Partial Monitoring
We will now discuss adaptivity in partial monitoring games. A partial monitoring game is defined by a pair and of matrices. Both matrices are visible to the learner and the adversary. At each time , the learner selects a row (or arm, action) and the opponent chooses a column . The learner then incurs a loss of and observes feedback 444We are considering oblivious adversarial opponents as before and further take entries of to be in . The assumption on the entries is not major since the learner can always appropriately encode the original entries by numbers.. When clear from context, we will denote by the loss of arm at time and by the feedback of arm at time . The expected regret here is:
4.2.1 Revealing Action Partial Monitoring
First consider the class of partial monitoring games with a revealing action– that is, suppose has a row with distinct elements. It is clear that if the learner plays this row, they can receive full information regarding which column the adversary has chosen. The cost of playing this row very well defines which class this game falls into (see for example the spam game discussed in Lattimore and Szepesvári (2019)), but in general, the minimax regret of these games scales as and these games therefore fall in the hard class of games. Revealing action games and label efficient prediction differ in the way they charge the learner for information. For label efficient prediction, we have seen that there is a fixed number of times (budget) one can obtain information, but there is no additional cost of doing so. In revealing action games however, there is a loss associated to each time the learner asks for information. We will now show a reduction from this class of games to the standard label efficient prediction we discussed in sections 2 and 3.
Let the cost of playing the revealing action be where is the revealing action row of . Suppose is the probability with which we play the revealing action at each round. here corresponds to the from earlier sections, however is now a free parameter555Note that the update rules (1) and (2) will now also have in place of . We will still run reservoir sampling in the background as before to obtain the optimistic messages . Now, in this light, the following theorem can be seen to follow from Theorem 2.
For , , and where the sequence of messages are generated using reservoir sampling, the expected regret of algorithm 1 modified for revealing action partial monitoring games with loss entries in satisfies the following:
Optimising the parameters and yields a bound of .
Note that now, we will again have to allocate rounds specifically for reservoir sampling as was the case with bandits, hence the additive term. The added corresponds to the cost paid for playing the revealing action.
4.2.2 Hard Partial Monitoring Games
We now turn to the hard class of partial monitoring games. As mentioned in Piccolboni and Schindelhauer (2008) and Cesa-Bianchi and Lugosi (2006), we will assume that there exists a matrix such that . This is not an unreasonable assumption, as if this does not hold for the given and , one can suitably modify (see Piccolboni and Schindelhauer (2008)) and to ensure , and if this condition continues to fail after appropriate modifications, Piccolboni and Schindelhauer (2008) show that sublinear regret is not possible for the original . Observe that will allow us to write . Therefore:
is now an unbiased estimate of . is still the optimistic messages where corresponds to an estimate of the average loss incurred by arm till time . These will still be obtained using reservoir sampling and we will maintain a separate reservoir for each arm . Note that since and the matrices , and are all visible to the learner, playing action at time for example will allow the learner to observe the th component of the loss for each action . Therefore, by maintaining an estimate (reservoir) for each component, we will be able to maintain an estimate for each arm.
Now, for these games we will use optimistic OMD without second order corrections (Rakhlin and Sridharan, 2012; Chiang et al., 2012). The update rules are the same as equations (5) and (6) without the term. Additionally, the arm we play will be sampled from where . The forced exploration is necessary to allow a minimum mass on all arms. Note that the structure defined by says that we potentially have to play all arms to maintain unbiased estimates of any arm. This forced exploration is unavoidable (see Cesa-Bianchi and Lugosi (2006)).
Note here the strong dependence on which is an outcome of each being dependent on potentially all () other actions.
5 Lower Bounds
We now prove explicit quadratic variation-based lower bounds for (standard) label efficient prediction and label efficient bandits. By capturing both the constraint on information as well as the quadratic variation of the loss sequence, our lower bounds generalize and improve upon existing lower bounds. We extend the lower bounds for label efficient prediction to further incorporate the quadratic variation of the loss sequence and enhance the quadratic variation dependent lower bounds for multi-armed bandits to also include the constraint on information by bringing in the number of labels the learner can observe ().
Our bounds will be proven in a 2-step manner similar to that in Gerchinovitz and Lattimore (2016). The main feature of step 1 (the Lemma step) is that of centering the Bernoulli random variables around a parameter instead of , which leads the regret bound to involve the
term corresponding to the variance of the Bernoulli distribution. Step 2 (the Theorem step) builds upon step 1 and shows the existence of a loss sequence belonging to an-variation ball (defined below) which also incurs regret of the same order. Recall the quadratic variation for a given loss sequence: . Now, for define an -variation ball as: . All of the proofs for this section have been postponed to Appendix D in the supplementary material.
Theorems 10 and 12, after incorporating give us lower bounds of and respectively. Our corresponding upper bounds are and .666We upper bound all of our dependent upper bounds by so as to consistently compare with the lower bounds. Note that and are in general incomparable and all that be said is that . Comparing the two tells us that our strategies are optimal in their dependence on and on the constraint in information indicated by . There is however a gap of . This gap was mentioned in Gerchinovitz and Lattimore (2016) for the specific case of the multi-armed bandit problem, and was closed recently in Bubeck et al. (2017). Barring the easy to see lower bound for prediction with expert advice (which is also what Theorem 10 translates to for ), we are unaware of other fundamental based lower bounds for prediction with expert advice. The upper bounds for prediction with expert advice however are of ((Hazan and Kale, 2010), (Steinhardt and Liang, 2014) etc.), and this again suggests the gap. Closing this for prediction with expert advice, label efficient prediction and for label efficient bandits remains open, as does the question of finding dependent lower bounds.
Label Efficient Prediction (Full Information)
As mentioned previously, the main difference here from the standard label efficient prediction lower bound proof (Cesa-Bianchi et al., 2005) is that of centering the Bernoulli random variables around a parameter which is responsible for ultimately bringing out the quadratic variation of the sequence. Our main statements for label efficient prediction are as follows.
Let , , . Then, for any randomized strategy for the label efficient prediction problem, there exists a loss sequence under which for where expectation is taken with respect to the internal randomization available to the algorithm and the random loss sequence.
Let , and . Then, for any randomized strategy for the label efficient prediction problem, where expectation is taken with respect to the internal randomization available to the algorithm.
Label Efficient Bandits
The main difference here from standard bandit proofs is that now, the total number of revealed labels (each label is now a single loss vector entry) cannot exceed . Hence, the term which appears in the analysis is upper bounded by (where denotes the pulls of arm up till time ).
Let , , . Then, for any randomized strategy for the label efficient bandit problem, there exists a loss sequence under which where expectation is taken with respect to the internal randomization available to the algorithm and the random loss sequence.
Let , and with . Then, for any randomized strategy for the label efficient bandit problem, where expectation is taken with respect to the internal randomization available to the algorithm.
We consider problems lying at the intersection of 2 relevant questions in online learning—how does one adapt to slowly varying data, and what best can be done with a constraint on information. As far as we know, the proposed algorithms are the first to jointly address both of these questions. There remain plenty of open problems in the area. Seeing to what extent universal dependent algorithms can be obtained in starved information settings is a direction of future interest, as is closing the gap in highlighted in Section 5. Moreover, extending the notion of adaptivity to partial monitoring games to consider locally observable games and even more interestingly, locally observable sub-games within hard games also remains open. Higher order lower bounds for partial monitoring games have also not been studied and one wonders to what extent adaptivity can help in partial monitoring.
- Allenberg et al. (2006) Chamy Allenberg, Peter Auer, László Györfi, and György Ottucsák. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In International Conference on Algorithmic Learning Theory, pages 229–243. Springer, 2006.
Azar et al. (2017)
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos.
Minimax regret bounds for reinforcement learning.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Bartók et al. (2011) Gábor Bartók, Dávid Pál, and Csaba Szepesvári. Minimax regret of finite partial-monitoring games in stochastic environments. In Proceedings of the 24th Annual Conference on Learning Theory, pages 133–154, 2011.
- Bartók et al. (2014) Gábor Bartók, Dean P. Foster, Dávid Pál, Alexander Rakhlin, and Csaba Szepesvári. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967–997, 2014.
- Beck and Teboulle (2003) Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167–175, 2003.
- Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. 01 2013.
- Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Bubeck et al. (2017) Sébastien Bubeck, Michael B. Cohen, and Yuanzhi Li. Sparsity, variance and curvature in multi-armed bandits. In ALT, 2017.
- Bubeck et al. (2019) Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. Improved path-length regret bounds for bandits. CoRR, abs/1901.10604, 2019.
- Cesa-Bianchi and Lugosi (2006) Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. ISBN 0521841089.
- Cesa-Bianchi et al. (2005) Nicolo Cesa-Bianchi, Gábor Lugosi, and Gilles Stoltz. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51(6):2152–2162, 2005.
- Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In COLT, volume 23 of JMLR Proceedings, pages 6.1–6.20. JMLR.org, 2012.
- Chow and Teicher (1980) Yuan-Shih Chow and Henry Teicher. Probability theory : independence, interchangeability martingales / yuan shih chow, henry teicher. SERBIULA (sistema Librum 2.0), 01 1980.
- Gerchinovitz and Lattimore (2016) Sébastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. In NIPS, pages 1190–1198, 2016.
- Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2:157–325, 01 2016. doi: 10.1561/2400000013.
- Hazan and Kale (2010) Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: regret bounded by variation in costs. Machine Learning, 80(2):165–188, Sep 2010.
- Hazan and Kale (2011) Elad Hazan and Satyen Kale. Better algorithms for benign bandits. J. Mach. Learn. Res., 12:1287–1311, July 2011. ISSN 1532-4435.
- Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Lattimore and Szepesvári (2019) Tor Lattimore and Csaba Szepesvári. Cleaning up the neighborhood: A full classification for adversarial partial monitoring. In ALT, volume 98 of Proceedings of Machine Learning Research, pages 529–556. PMLR, 2019.
- Piccolboni and Schindelhauer (2008) Antonio Piccolboni and Christian Schindelhauer. Discrete prediction games with arbitrary feedback and loss (extended abstract). pages 208–223, 01 2008. doi: 10.1007/3-540-44581-1_14.
- Rakhlin and Sridharan (2012) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. arXiv preprint arXiv:1208.3728, 2012.
- Steinhardt and Liang (2014) Jacob Steinhardt and Percy S. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In ICML, 2014.
- Wei and Luo (2018) Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, COLT 2018, 2018.
Appendix A Label Efficient Prediction Main Proofs
where we chose when applying it to update rule (1). Now observe that:
Combining the above inequalities with equation (3) gives us
where (by non-negativity of Bregman divergence). ∎
For some radius , define the ellipsoid . If , , then, for all , we have . Additionally, for all .
Proof of Lemma 13. As , we can say that which further implies . Hence, we have . Now, since , the first part of the lemma follows. Further observe . ∎
Because of the convexity of , to prove our claim, it is sufficient to show that for all points on the boundary of the ellipsoid. By Taylor’s theorem, we know that on the line segment between and such that: