The online learning paradigm [LW94, CBL06] has become a key tool for solving a wide spectrum of problems such as developing strategies for players in large multiplayer games [BEDL06, BHLR08, Rou15, LST16, FLL16], designing online marketplaces and auctions [BH05, CBGM13, RW16], portfolio investment [Cov91, FS97, HAK07], online routing [AK04, KV05]. In each of these applications, the learner has to repeatedly select an action on every round. Different actions have different costs or losses associated with them on every round. The goal of the learner is to minimize her cumulative loss and the performance of the learner is evaluated by the notion of “regret”, defined as the difference between the cumulative loss of the learner, and the cumulative loss of the benchmark.
The term “small-loss regret bound” is often used to refer to bounds on regret that depend (or mostly depend) on , rather than the total number of rounds played often referred to as the time horizon. For instance, for many classical online learning problems, one can in fact show that regret can be bounded by rather than . However, these algorithms use the full information model: assume that on every round, the learner receives as feedback the losses of all possible actions (not only the selected actions). In such full information settings, it is well understood when small-loss bounds are achievable and how to design learning algorithms that attain them. However, in most applications, full information about losses of all actions is not available. Unlike the full information case, the problem of obtaining small-loss regret bounds for partial information settings is poorly understood. Even in the classical multi-armed bandit problem, small-loss bounds are only known in expectation against the so called oblivious adversaries or comparing against the lowest expected cost of an arm (and not the actual lowest cost), referred to as pseudo-regret.
The goal of this paper is to develop robust techniques for extending the small-loss guarantees to a broad range of partial feedback settings where learner only observes losses of selected actions and some neighboring actions. In the basic online learning model, at each round , the decision maker or learner chooses one action from a set of actions, typically referred to as arms
. Simultaneously an adversary picks a loss vectorindicating the losses for the arms. The learner suffers the loss of her chosen arm and observes some feedback. The variants of online learning differ by the nature of feedback received. The two most prominent such variants are the full information setting, where the feedback is the whole loss vector, and the bandit setting where only the loss of the selected arm is observed. Bandits and full information represent two extremes. In most realistic applications, a learner choosing an action , learns not only the loss associated with her chosen action , but also some partial information about losses of some other actions. A simple and elegant model of this partial information is the graph-based feedback model of [MS11, ACBG17], where at every round, there is a (possibly time-varying) undirected graph representing the information structure, where the possible actions are the nodes. If the learner selects an action and incurs the loss , she observes the losses of all the nodes connected to node by an edge in . Our main result in Section 3 is a general technique that allows us to use any full information learning algorithm as a black-box, and design a learning algorithm whose regret can be bounded with high probability as , where is the maximum independence number of the feedback graphs. This graph-based information feedback model is a very general setting that can encode all of full information, bandit, as well as a number of other applications.
1.1 Our contribution
We develop a unified, black-box technique to achieve small-loss regret guarantees with high probability in various partial information feedback models. We obtain the following results.
In Section 3, we provide a generic black box reduction from any small-loss full information algorithm. When used with known algorithms it achieves actual regret guarantees of that hold with high probability for any of pure bandits, semi-bandits, contextual bandits, or feedback graphs (with dependence on the information structure in the as for the first three, and for feedback graphs). There are three novel features of this result. First, unlike most previous work in partial information that is heavily algorithm-specific, our technique is black-box in the sense that it takes as input a small-loss full information algorithm and, via a small modification, makes it work under partial information. Second, prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e., taking advantage of the increased information feedback), while we provide a high probability small-loss guarantee. Last, our guarantees are not for pseudo-regret but actual regret guarantees that hold with high probability.
In Section 4, we show various applications. The black-box nature of our reduction allows us to use the full information learning algorithms best suited for each application. We obtain small-loss guarantees for semi-bandits [KV05] (including routing in networks), for contextual bandits [ACBFS03, LZ07] (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators [HW98] as needed in games with dynamic population [LST16, FLL16].
In Section 5, we focus on the special case of bandits, semi-bandits, graph feedback from fixed graphs, and shifting comparators. In each setting we take advantage of properties of a learning algorithm best suited in the application to alleviate the inefficiencies resulting from the black-box nature of our general reduction. For bandits and semi-bandits, we provide optimal small-loss actual regret high-probability guarantees of . Previous work for bandits and semi-bandits offered analogous bounds only for pseudo-regret and only in expectation. This answers an open question of [Neu15b, Neu15a]. In the case of fixed feedback graphs, we achieve optimal dependence on loss, at the expense of the bound depending on clique-partition number of the graph, rather than the independence number.
Our main technique is a dual-thresholding scheme that temporarily freezes low-performing actions, i.e. does not play them at the current round. Traditional partial information guarantees are based on creating an unbiased estimator for the loss of each arm and then running a full information algorithm on the estimated loses. The most prominent such unbiased estimator, calledimportance sampling, is equal to the actual loss divided by the probability with which the action is played. This division can make the estimated losses unbounded in the absence of a lower bound on the probability of being played. Algorithms like EXP3 [ACBFS03] for the bandit setting or Exp3-DOM [ACBG17] for the graph-based feedback setting mix in a amount of noise which ensures that the range of losses is bounded. Adding such uniform noise works well for learners maximizing utility, but can be very damaging when minimizing losses. In the case of utilities, playing low performing arms with a small probability, can only lose at most an fraction of the utility. In contrast, when the best arm has small loss, the losses incurred due to the noise can dominate. This approach can only return uniform bounds with regret since, even in the case that there is a perfect arm that has loss, the algorithm keeps playing low-performing arms. Some specialized algorithms do achieve small-loss bounds for bandits, but these techniques extend neither to graph feedback nor to high probability guarantees (see also the discussion below about related work).
Instead of mixing in noise, we take advantage of the freezing idea, originally introduced by Allenberg et al. [AAGO06] with a single threshold offering a new way to adapt the multiplicative weights algorithm to the bandit setting. The resulting estimator is negatively biased for the arms that are frozen but is always unbiased for the selected arm. Using these expectations, the regret bound of the full information algorithm can be used to bound the expected regret compared to the expected loss of any fixed arm, achieving low pseudo-regret in expectation. To achieve good bounds, we need to guarantee that the total probability frozen is limited. By freezing arms with probability less than , the total probability that is frozen at each round is at most and therefore contributes to a regret term of times the loss of the algorithm which gives a dependence on on the regret bound. This was analyzed in the context of multiplicative weights in [AAGO06].
Our main technical contribution is to greatly expand the power of this freezing technique. We show how to apply it in a black-box manner with any full information learning algorithm and extend it to graph-based feedback. To deal with the graph-based feedback setting, we suggest a novel and technically more challenging dual-threshold freezing scheme. The natural way to apply importance sampling in the graph-based feedback is by dividing the actual loss with the probability of being observed, i.e. the sum of the probabilities that the action and its neighbors are played. An initial approach is to freeze an action if its probability of being observed is below some threshold . We show that the total probability frozen by this step is bounded by , where is the size of the maximum independent number of the feedback graph. To see why, consider a maximal independent set of the frozen actions and note that all frozen actions are observed by some node in . This observation seems to imply that we can replace the dependence on by a dependence on . However there are externalities among actions as freezing one action may affect the probability of another being observed. As a result, the latter may need to be frozen as well to ensure that all active arms are observed with probability at least (and therefore obtain our desired upper bound on the range of the estimated losses). This causes a cascade of freezing, with possibly freezing a large amount of additional probability.
To limit this cascade effect, we develop a dual-threshold freezing technique: we initially freeze arms that are observed with probability less than , and subsequently use a lower threshold and only freeze arms that are observed with probability less than . This technique allows us to bound the total probability of arms that are frozen subsequently by the total probability of arms that are frozen initially. We prove this via an elegant combinatorial charging argument of Claim 3.
Last, to go beyond pseudo-regret and guarantee actual regret bounds with high probability, it does not suffice to have the estimator be negatively biased but we need to also obtain a handle on the variance. We prove that freezing also provides such a lever leading to a high-probabilityregret guarantee that holds in a black-box manner. Interestingly, this freezing technique via a small modification enables the same guarantee for semi-bandits where the independent set is replaced by the number of elements (edges).
In order to obtain the optimal high-probability guarantee for bandits and semi-bandits, we need to combine our black box analysis with taking advantage of features of concrete full information learning algorithms. The black-box nature of the previous analysis is extremely useful in demonstrating where additional features are needed. Combining our analysis with the implicit exploration technique [KNVM14] similarly as in the analysis of Neu [Neu15a], we develop an algorithm based on multiplicative weights, which we term GREEN-IX, which achieves the optimal high-probability small-loss bound for the pure bandit setting. Using an alternative technique of Neu [Neu15b]: truncation in the follow the perturbed leader algorithm, we also obtain the corresponding result for semi-bandits.
1.2 Related work
Online learning with partial information dates back to the seminal work of Lai and Robbins [LR85]. They consider a stochastic version, where losses come from fixed distributions. The case where the losses are selected adversarially, i.e. they do not come from a distribution and may be adaptive to the algorithm’s choices, which we examine in this paper, was first studied by Auer et al. [ACBFS03] who provided the EXP3 algorithm for pure bandits and the EXP4 algorithm for learning with expert advice (a more general model than contextual bandits considered in [LZ07]). They focus on uniform regret bounds, i.e. that grow as a function of time , and bound mostly the expected performance, but such guarantees can also be derived with high probability [ACBFS03, AB10, BLL11]. Data-dependent guarantees are easily derived from the above algorithms for the case of maximizing some reward as even getting reward with probability of only causes an fraction of loss in utility. In contrast, incurring high cost with a small probability can dominate the loss of the algorithm, if the best arm has small loss. In this paper we develop data-dependent guarantees for partial information algorithm for the cases of losses. There are a few specialized algorithms that achieve such small-loss guarantees for the case of bandits for pseudo-regret, e.g. by ensuring that the estimated losses of all arms remain close [AAGO06, Neu15b] or using a stronger regularizer [RS13, FLL16], but all of these methods neither offer high probability small-loss guarantees even for the bandit setting, nor extend to graph-based feedback. Our technique allows us to develop small-loss bounds on actual regret with high probability.
The graph-based partial information that we examine in this paper was introduced by Mannor and Shamir [MS11]
who provided ELP, a linear programming based algorithm achievingregret for undirected graphs. Alon et al. [ACBGM13, ACBG17] provided variants of Exp3 (Exp3-SET) that recovered the previous bound via what they call explicit exploration. Following this work, there have been multiple results on this setting, e.g.[ACBDK15, CHK16, KNV16, TDD17], but prior to our work, there was no small-loss guarantee for the feedback graph setting that could exploit the graph structure. To obtain a regret bound depending on the graph structure, the above techniques upper bound the losses of the arms by the maximum loss which results in a dependence on the time horizon instead of . Addressing this, we achieve regret that scales with an appropriate problem dimension, the size of the maximum independent set , instead of ignoring the extra information and only depending on the number of arms as all small-loss results of prior work.
Biased estimators have been used prior to our work for achieving better regret guarantees. The freezing technique of [AAGO06] can be thought of as the first use of biased estimators. Their GREEN algorithm uses freezing in the context of the multiplicative weights algorithm for the case of pure bandits. Freezing keeps the range of estimated losses bounded and when used with the multiplicative weights algorithm, also keeps the cumulative estimated losses very close, which ensures that one does not lose much in the application of the full information algorithm. Using these facts Allenberg et al. [AAGO06] achieved small-loss guarantees for pseudo-regret in the classical multi-armed bandit setting. An approach very close to freezing is the implicit exploration of Kocák et al. [KNVM14] that adds a term in the denominator of the estimator making the estimator biased, even for the selected arms. TheFPL-TrIX algorithm of Neu [Neu15b] is based on the Follow the Perturbed Leader algorithm using implicit exploration together with truncating the perturbations to guarantee that the estimated losses of all actions are close to each other and the geometric resampling technique of Neu and Bartók [NB13] to obtain these estimated losses. His analysis provides small-loss regret bounds for pseudo-regret, but does not extend to high-probability guarantees. The EXP3-IX algorithm of Kocák et al. [KNVM14] combines implicit exploration with multiplicative weights to obtain, via the analysis of Neu [Neu15a], high-probability uniform bounds. Focusing on uniform regret bounds, exploration and truncation were presented as strictly superior to freezing. In this paper, we show an important benefit of the freezing technique: it can be extended to handle feedback graphs (via our dual-thresholding). We also combine freezing with multiplicative weights to develop an algorithm we term GREEN-IX which achieves optimal high-probability small-loss for the pure bandit setting. Finally, combining freezing with the truncation idea, we obtain the corresponding result for semi-bandits; in contrast, the geometric resampling analysis does not seem to extend to high probability since it does not provide a handle on the variance of the estimated loss.
In this section we describe the basic online learning protocol and the partial information feedback model we consider in this paper. In the online learning protocol, in each round , the learner selects a distribution over possible actions, i.e. denotes the probability with which action is selected on round . The adversary then picks losses where denotes the loss of action on round . The learner then draws action from the distribution and suffers the corresponding loss for that round. In the end of the round , the learner receives feedback about the losses of the selected action and some neighboring actions. The feedback received by the learner on each round is based on a feedback graph model described below.
2.1 Feedback graph model
We assume that the learner receives partial information based on an undirected feedback graph that could possibly vary in every round. The learner observes the loss of the selected arm and, in addition, she also observes the losses of all arms connected to the selected arm in the feedback graph. More formally, she observes the loss for all the arms where denotes the set containing arm and all neighbors of in at round . The full information feedback setting and the bandit feedback setting are special cases of this model where the graph is the clique and the empty graph respectively for all rounds .
We allow the feedback graph to change each round , but assume that the graph is known to the player before selecting her distribution . This model also includes the contextual bandits problem of [ACBFS03, LZ07] as a special case, where each round the learner is also presented with an additional input , the context. In this contextual setting, the learner is offered policies, each suggesting an action depending on the context, and each round the learner can decide which policy’s recommendation to follow. To model this with our evolving feedback graph, we use the policies as nodes, and connect two policies with an edge in if they recommend the same action in the context of round .
In the adversarial online learning framework, we assume only that losses are in the range . The goal of the learner is to minimize the so called regret against an appropriate benchmark. The traditional notion of regret compares the performance of the algorithm to the best fixed action in hindsight. For an arm we define regret as:
where is the time horizon. To evaluate performance, we consider regret against the best arm:
Note that the regrets and
are random variables.
A slightly weaker notion of regret is the notion of pseudoregret (c.f. [BCB12]), that compares the expected performance of the algorithm to the expected loss of any fixed arm , fixed in advance and not in hindsight. More formally, this notion of expected regret is:
This is weaker than the expected regret .111To see the difference, consider arms that are similar but have high variance. Pseudoregret compares the algorithm’s performance against the expected performance of arms, while regret compares against the “best” arm depending on the outcomes of the randomness. This difference can be quite substantial, like when throwing balls into bins the expected load of any bin is , while the expected maximum load is ..
We aim for an even stronger notion of regret, guaranteeing low regret with high probability, i.e. probability for all simultaneously, instead of only in expectation, at the expense of a logarithmic dependence on in the regret bound for any fixed . Note that any high-probability guarantee concerning for any fixed arm with failure probability can automatically provide an overall regret guarantee with failure probability . A high-probability guarantee on low also implies low regret in expectation.222If the algorithm guarantees regret at most with probability at least for any , then we can obtain the expected regret bound of by upper bounding it by the integral .
Small-loss regret bound.
The goal of this paper is to develop algorithms with small-loss regret bounds, where the loss remains small when the best arm has small loss, i.e. when regret depends on the loss of the comparator, and not on the time horizon. To achieve this, we focus on the notion of approximate regret (c.f. [FLL16]), which is a multiplicative relaxation of the regret notion. We define -approximate regret for a parameter as
We will prove bounds on in high probability and in expectation, and will use these to provide small-loss regret bounds by tuning appropriately, an approach that is often used in the literature in achieving classical regret guarantees and is referred to as doubling trick. Typically, approximate regret bounds depend inversely on the parameter . For instance, in classical full information algorithms, the expected approximate regret is bounded by and therefore setting , one obtains the classical uniform bounds. If we knew , the loss of the best arm at the end of round , one could set and get the desired guarantee. Of course, is not known in advance, and depending on the model of feedback, may not even be observed either. To overcome these difficulties, we can make the choice of depend on , the loss of the algorithm instead, and apply doubling trick: start with a relatively large , hoping for a small and halve when we observe higher losses.
2.3 Other applications.
Semi-bandits We also extend our results to a different form of partial information: semi-bandits. In the semi-bandit problem we have a set of elements , such as edges in a network, and the learner needs to select from a set of possible actions , where each possible action correspond to a subset of the elements . An example is selecting a path in a graph, where at round , each element has a delay , and the learner needs to select a path (connecting her source to her destination), and suffers the sum of the losses . We use as the loss of the strategy at time . We assume that the learner observes the loss on all edges in her selected strategy, but does not observe other losses. We measure regret compared to the best single strategy with hindsight, so use as the set of (possibly exponentially many) comparators.
Contextual bandits. Another class of important application is the contextual bandit problem, where the learner has a set of actions to choose from, but each step also has a context: At each time step , she is presented with a context , and can base her choice of action on the context. She also has a set policies where each if a function from contexts to actions. As an example, actions can be a set of medical treatment options, and contexts are the symptoms of the patient. A possible policy class can be finite given explicitly, or large and only implicitly given, or even can be an infinite class of possible policies.
Regret with shifting comparators. In studying learning in changing environments [HW98], such as games with dynamic populations [LST16], it is useful to have regret guarantees against not only a single best arm, but also against a sequence of comparators, as changes in the environment may change the best arm over time. We overload to denote the vector of the comparators in such settings. If the comparator changes too often, no learning algorithm can do well against this standard. We will consider sequences where has only a limited number of changes, that is for all but rounds (with not known to the algorithm). To compare the performance to a sequence of different comparators, we need to extend our regret notions to this case by:
where corresponds to the multiplicative factor that comes in the regret relaxation. Typically the approximate regret guarantee depends linearly on the number of changes in the comparator sequence.
3 The black-box reduction for graph-based feedback
In this section, we present our black-box framework turning any full-information small-loss learning algorithm into an algorithm with a high-probability small-loss guarantee in the partial information feedback setting.
Our approach is based on an improved version of the classical importance sampling. The idea of importance sampling is to create for each arm an estimator for the loss of the arm and run the full information algorithm on the estimated losses. In classical importance sampling, the estimated loss of an arm is equal to its actual loss divided by the probability of it being observed. This makes the estimator unbiased as the expected estimated loss of any arm is equal to its actual loss. This general framework of importance sampling is also used with feedback graphs in [ACBG17]. In the feedback graph observation model, we acquire information for all arms observed and not only for the ones played; we therefore create an unbiased estimator via dividing the observed losses of an arm by the probability of it being observed. However, there is an important issue all these algorithms need to deal with: the estimated losses can become arbitrarily large as the probability of observing an arm can be arbitrarily low. This poses a major roadblock in the black-box application of a classical full information learning algorithm. To deal with this, typical partial information algorithms, such as EXP3 [ACBFS03] or EXP3-DOM [ACBG17], mix the resulting distribution with a small amount of uniform noise across arms, guaranteeing a lower bound on the probability of being observed and therefore an upper bound on the range of estimated losses. Since the added noise makes the algorithm play badly performing arms, this approach results in uniform regret bounds and not small-loss guarantees.
We use an alternate technique, first proposed by Allenberg et al. [AAGO06] in the context of the Multiplicative Weights algorithm for the bandit feedback setting. We set a threshold and in each round neither play nor update the loss of arms with probability below this threshold. We refer to such arms as (temporarily) frozen. We note that frozen arms may get unfrozen in later rounds, if other arms incur losses, as we update frozen arms assuming their loss is 0. The resulting estimator for the loss of an arm is no longer unbiased since the estimated loss of frozen arms is
. However, crucially the estimator is unbiased for the arms that we play and negatively biased for all arms, which allows us to extend the regret bound of the full information algorithm. When freezing some arms, we need to normalize the probabilities of the other arms so that they form a probability distribution. In order to obtain-approximate regret guarantees, the total probability of all frozen arms should be at most . Allenberg et al. [AAGO06] guarantee this for the bandit feedback setting by selecting resulting in a dependence on the number of arms in the approximate regret bound.
In this section we extend this technique in three different ways.
We obtain small-loss learning algorithms for the case of feedback graphs, where the regret bound depends on the size of the maximum independent set , instead of (number of nodes in ).
We achieve the above via a black-box reduction using any full information algorithm, not only via using the Multiplicative Weights algorithm.
We provide a small-loss guarantee that holds with high probability and not only in expectation.
Seeking for bounds that are only a function of the size , and have no dependence on the number of arms, we introduce a novel dual-threshold freezing technique. At each round , we first freeze arms that are observed with probability less than some threshold . We show (Claim 3) that the total probability frozen at this initial step is at most . Unfortunately, freezing an arm in turn decreases the probability that the neighbors are observed. This effect can propagate and cause additional arms to be observed with probability less than , violating the upper bound on the estimated loss. To bound the total probability frozen during the propagation steps as a function of while still maintaining a lower bound on the probability of observation for the played arms, we recursively freeze arms whose observation probability is smaller than . We show in Claim 3 that the total probability frozen during the recursive process is at most 3 times the total probability frozen in the initial step.
We proceed by providing the algorithm (Algorithm 1), the crucial lemma that enables improved bounds beyond bandit feedback (Lemma 3), and the black-box guarantee. For clarity of presentation we first provide the approximate regret guarantee in expectation (Theorem 3) and then show its high-probability version (Theorem 3), in both cases assuming that the algorithm has access to an upper bound of the maximum independence number as an input parameter. In Theorem 3 we provide the small-loss version of the above bound without explicit knowledge of this quantity.
At every round , the total probability of frozen arms is at most : , and hence any non-frozen arm increases its probability due to freezing by a factor of at most .
We first consider the arms that are frozen due to the -threshold (line 3 of the algorithm). Claim 3 shows that the total probability frozen in the initial step is bounded by . We then focus on the arms frozen due to the recursive -threshold (line 4 of the algorithm). Claim 3 bounds the total probability frozen in the propagation processs by three times the total probability frozen in the initial step. Combining the two Claims, we obtain:
The lemma then follows from the relation in the normalization step of the algorithm (line 5). ∎
Next we show the two main claims needed in the previous proof. The total probability frozen in the initial set is bounded by .
Let be a maximal independent set on . Since the independent set is maximal, every node in either is in or has a neighbor in , so we obtain:
where the last inequality follows from the fact that there are at most nodes in and, since they are frozen, the probability of being observed is at most for each of them. ∎
The total probability frozen in the propagation steps is bounded by three times the total probability frozen at the initial step. More formally:
The purpose of the lower threshold in line 4 is to limit the propagation of frozen probability. Consider an arm frozen on step . Since arm was not frozen at step , the initial probability of being observed by any node of is at least . When this arm becomes frozen, it is observed with probability at most . Hence of the original probability stems from arms frozen earlier. Using this, we can bound the probability mass in by at most 1.5 times the mass of . Further, from these arms at most of the originally at least probability is newly frozen, and hence can affect non yet frozen arms, creating a further cascade. We show that the total frozen probability can be at most times the probability of nodes in . The proof of this fact follows in a way that is analogous of how the number of internal nodes of a binary tree is bounded by the number of leaves, as any node can have at most 1 parent, while having 2 children.
More formally, we consider an auxiliary function that serves as an upper bound of the left hand side and a lower bound of the right hand side, proving the claim. The claim is focused on a single round . For simplicity of notation, we drop the dependence on from the notations, i.e., use for the set of nodes frozen, for the probability of node , use for the graph, and for its edge-set. Let . We order all nodes in based on when they are frozen. More formally, if and with then . This is a partial ordering as does not order nodes frozen at the same iteration of the recursive freezing. We now introduce the heart of the auxiliary function which lies in the sum of the products of probabilities along edges with , such that , i.e.
To lower bound this quantity, we sum over first. Node was not in so its neighborhood has a total probability mass of at least . By the time is frozen, the remaining probability mass is less than , so a total probability mass of at least must come from earlier frozen neighbors.
To upper bound the above quantity, we sum over first, and separate the sum for and . Nodes have a total probability of less than in their neighborhood, as they are frozen in line 3 of the algorithm. Nodes have at most probability mass left in their neighborhood when they become frozen and therefore at most this much total probability on neighbors later in the ordering.
The above lower and upper bounds imply that and hence we obtain the claimed bound (reintroducing the round in the notation):
We are now ready to prove our first result: a bound for learning with partial information based on feedback graphs. We first provide the guarantee for approximate pseudoregret in expectation. We assume both the learning rate as well as an upper bound on the size of the independent sets are given as an input. At the end of this section, we show how the results can be turned into regret guarantees via doubling trick without knowledge of the independence number.
Let be any full information algorithm with an expected approximate regret guarantee given by: against any arm , when run on losses in . The Dual-Threshold Freezing Algorithm (Algorithm 1) run with learning parameter on input , , , has expected -approximate regret guarantee: .
First notice that our random estimated loss is negatively biased, so for all arms and all rounds , where expectation is taken over the choice of arm . Bounding the loss of the algorithm against the expected estimated loss of arm implies the bound we seek.
Next consider the losses incurred by the algorithm compared to the estimated losses the full information algorithm observes. Note that the estimator is unbiased for the arms that the algorithm plays (as those are not frozen), so the expected loss of the full information algorithm when run on the estimated losses is equal to its expected loss when run on the actual losses: for all . Last, freezing guarantees that the maximum estimated loss is (since the probability of being observed is at least for any non-frozen arm else it would freeze at step 4 of the algorithm). Combining these and using that we obtain the following:
|as on all arms played.|
|by Lemma 3.|
|by the low approx regret of .|
|as the estimator is negatively biased|
|using definitions of , , and .|
Notice that it was important to be able to use a freezing threshold instead of for the above analysis, allowing an approximate regret bound with no dependence on .
High probability bound.
To obtain a high-probability guarantee (and hence a bound on the actual regret, not pseudoregret), we encounter an additional complication since we need to upper bound the cumulative estimated loss of the comparator by its cumulative actual loss. For this purpose, the mere fact that the estimator is negatively biased does not suffice. The estimator may, in principle, be unbiased (if the arm is never frozen), and the variance it suffers can be high, which could ruin the small-loss guarantee. To deal with this, we apply a concentration inequality, comparing the expected loss to a multiplicative approximation of the actual loss. This is inspired by the approximate regret notion, is a quantity with negative mean, and has variance that depends on as well as the magnitude of the estimated losses which is .
Let be any full information algorithm with an expected approximate regret guarantee of: , against any arm , when run on losses in . The Dual-Threshold Freezing Algorithm (Algorithm 1) run with learning parameter on input , , , has -approximate regret: with probability for any . To prove the theorem, we need the following concentration inequality, showing that the sum of a sequence of (possibly dependent) random variables cannot be much higher than the sum of their expectations: Let be a sequence of non-negative random variables, s.t. . Let . Then, for any , with probability at least
and also with probability at least
The proof follows the outline of classical Chernoff bounds for independent variables combined with the law of total expectation to handle the dependence. For completeness, the proof details are provided in Appendix A.
Proof of Theorem 3.
To obtain a high-probability statement, we use Lemma 3 multiple times as follows:
Show that the sum of the algorithm’s losses stays close to the sum of the expected losses.
Show that the sum of the expected losses stays close the sum of the expected estimated losses used by the full information algorithm
Show that the sum of the estimated losses of each arm stays close to the sum of the actual losses.
Starting with the item 1, we use , and note that its expectation conditioned on the previous losses is so we obtain that, for any , with probability at least
Next item 3, for a comparator we use the lemma with and its expectation . Now is bounded by and not 1, so by scaling we obtain that with probability
Finally, we use the lower bound in the lemma to show item 2: for , the expected losses observed by the full information algorithm, and its expectation . Again, the so we obtain that with probability ,
Using union bound and , all these inequalities hold simultaneously for all . To simplify notation, we use for the error bounds above.
The small-loss guarantee without knowing .
We presented the results so far in terms of approximate regret and assuming we have , an upper bound for the maximum independent set, as an input. Next we show that we can use this algorithm with the classical doubling trick without knowing , and achieving low regret both in expectation as well as with high probability, not only approximate regret. We start with a large and small and halve and double them respectively, when observing that they are not set right. There are two issues worth mentioning.
First, unlike full information, partial information does not provide access to the loss of the comparator . As a result, we apply doubling trick on the loss of the algorithm instead and then bound the regret of the algorithm appropriately. This is formalized in the following lemma which follows standard doubling arguments and whose proof is provided in Appendix A for completeness. [standard doubling trick] Suppose we have a randomized algorithm that takes as input any and guarantees that, for some and some function , and any , with probability , for any time horizon and any comparator :
Assume that we use this algorithm over multiple phases (by restarting the algorithm when a phase end), we run each phase with until where denotes the cumulative loss of the algorithm for phase . Then, for any , the regret for this multi-phase algorithm is bounded, with probability at least as:
Second, observing the maximum independent set is challenging since this task is NP-hard to approximate. However, if one looks carefully into our proofs, we just require knowledge of a maximal independent set on the -frozen arms and not one of maximum size. This can be easily computed greedily at each round and therefore our algorithm can handle changing graphs without requiring knowledge of the maximum independence number. Combining these two observations, we prove the following small-loss guarantee.
Let be any full information algorithm with -approximate regret bounded by when run on losses in and with parameter . If one runs the Dual-Threshold Freezing Algorithm (Algorithm 1) as in Theorem 3 and using the doubling scheme as in Lemma 3 and tuning appropriately on each phase, then for any , with probability at least the regret of this algorithm is bounded by .
First for simplicity assume that is known in advance. In this case, using Theorem 3, we can conclude that for any , Algorithm 1 run with enjoys an -approximate regret guarantee of . Hence, running Algorithm 1 while tuning -parameter using doubling trick as in Lemma 3 with and yields the regret guarantee of
If is not known in advance, we can begin with a guess (say ) and double the guess every time that this is incorrect, i.e. the maximal independent set of the -frozen nodes has more than nodes. We make at most updates. Within one phase with the same update, the previous guarantee holds with probability at least some . At the time of the update we can lose an extra of at most . For the rest of the rounds, the guarantees work additively. Therefore, setting , we obtain the previous guarantee with an extra decay in the guarantee. Since , the dependence on is dropped in the notation of the regret bound. ∎
4 Other applications of the black-box framework
The framework of the previous section can capture via a small modification other partial information feedback settings. We discuss here semi-bandits and contextual bandits (including applications with infinite comparator classes), as well as learning tasks against shifting comparators. In these settings, our framework converts data-dependent guarantees of full-information algorithms (which are well understood) to similar high-probability bounds under partial information.
To model semi-bandits as a variant of our feedback graph framework, we construct a bipartite graph with nodes and , and connect strategies to the elements included in .We note that this graph does not need to be explicitly maintained by the algorithm as we discuss in the end of the subsection and elaborate upon in Appendix B.1.
Similarly to the previous section, we provide a reduction from full information to partial information for this setting. The full-information algorithm runs on estimated losses created by importance sampling as before and induces, at round , a probability distribution on the set of strategies as only strategies can be selected and not individual elements. We assume a bound on the expected approximate regret of for the full information algorithm when losses are in . This will scale linearly with the magnitude of losses of an action . We apply importance sampling and freezing to the elements . The probability of observing an element is the sum of the probabilities of the adjacent strategies. For clarity of presentation, we first assume that we have access to these probabilities and then show how this can be obtained via sampling. This demonstrates the leverage freezing offers in bounding the number of samples required. We modify the freezing process to freeze elements when observed with probability less than , and then freeze all strategies that contain some frozen element.
The reduction is similar to the one of Algorithm 1: In the corresponding initial step and recursive process, we only freeze nodes that are in the set if their observation probability is below the threshold and apply a single threshold for all the recursive steps (instead of multi-thresholding). We subsequently freeze any node in that is adjacent to a frozen node in , and repeat the recursive process until no unfrozen element has probability of observation smaller than . After the freezing process, the final probability distribution is derived again via a renormalization on the non-frozen strategies as in step 5 of Algorithm 1. The resulting algorithm is provided in Algorithm 2.
Given that, we can now provide the equivalent lemma to Lemma 3 to bound the total frozen probability. At round , the total probability of frozen strategies is at most , i.e. .
When a node in becomes frozen in the initial step, it means that its probability of observation is less than . Since the probability of playing adjacent nodes in contributes to this probability of observation, at the initial step of the recursive process, the total probability frozen is less than times the number of nodes in that are frozen. As before, freezing some nodes in , may cause other nodes to become frozen. By freezing , we also freeze all its neighbors , which can decrease the observation probability of other edges. In the propagation process, if an element becomes frozen its total probability of observation by not already frozen strategies is at most . Hence the total frozen probability is at most which concludes the lemma. ∎
Using the Follow the Perturbed Leader algorithm [KV05] we obtain an approximate regret bound of . The magnitude of the estimated losses of a strategy is at most where corresponds to the maximum number of elements in a strategy, e.g., the maximum length of any path. Let be any full information algorithm for the problem whose expected approximate regret is bounded as when run on losses bounded by . Further assume that we have access to used in the algorithm. Then, the Semi-Bandit Freezing Algorithm (Algorithm 2) run with learning rate on input guarantees that for any with probability ,
The proof follows similarly as the one of Theorem 3 adjusted to the semi-bandit setting. We denote by the probability of observing an element . Also we use the subscript for strategy nodes (paths) and the subscript for element nodes (edges). Recall that is the maximum number of edges in any path. More formally, for each comparator , we obtain the following set of inequalities with probability at least :
|Using Lemma 4.1|
|Using the full information guarantee and noting that losses are bounded by , this is bounded by|
|Now by applying concentration Lemma 3, for each and taking a union bound over ,|
Since , using and such that , we conclude the proof. ∎
Sampling the probabilities of observation.
In the previous part, we assumed that, at any point, we have access to the probability that an element is observed. This is used both to define which elements are frozen and to define the estimated loss