From Bandits to Experts: A Tale of Domination and Independence

07/17/2013 ∙ by Noga Alon, et al. ∙ Tel Aviv University Università degli Studi di Milano 0

We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prediction with expert advice —see, e.g., [10, 13, 5, 8, 6]— is a general abstract framework for studying sequential prediction problems, formulated as repeated games between a player and an adversary. A well studied example of prediction game is the following: In each round, the adversary privately assigns a loss value to each action in a fixed set. Then the player chooses an action (possibly using randomization) and incurs the corresponding loss. The goal of the player is to control regret, which is defined as the excess loss incurred by the player as compared to the best fixed action over a sequence of rounds. Two important variants of this game have been studied in the past: the expert setting, where at the end of each round the player observes the loss assigned to each action for that round, and the bandit setting, where the player only observes the loss of the chosen action, but not that of other actions.

Let be the number of available actions, and be the number of prediction rounds. The best possible regret for the expert setting is of order . This optimal rate is achieved by the Hedge algorithm [8] or the Follow the Perturbed Leader algorithm [9]. In the bandit setting, the optimal regret is of order , achieved by the INF algorithm [2]. A bandit variant of Hedge, called Exp3 [3], achieves a regret with a slightly worse bound of order .

Recently, Mannor and Shamir [11] introduced an elegant way for defining intermediate observability models between the expert setting (full observability) and the bandit setting (single observability). An intuitive way of representing an observability model is through a directed graph over actions: an arc from action to action implies that when playing action we get information also about the loss of action . Thus, the expert setting is obtained by choosing a complete graph over actions (playing any action reveals all losses), and the bandit setting is obtained by choosing an empty edge set (playing an action only reveals the loss of that action).

The main result of [11] concerns undirected observability graphs. The regret is characterized in terms of the independence number of the undirected observability graph. Specifically, they prove that is the optimal regret (up to logarithmic factors) and show that a variant of Exp3, called ELP, achieves this bound when the graph is known ahead of time, where interpolates between full observability ( for the clique) and single observability (

for the graph with no edges). Given the observability graph, ELP runs a linear program to compute the desired distribution over actions. In the case when the graph changes over time, and at each time step ELP observes the current observability graph before prediction, a bound of

is shown, where is the independence number of the graph at time . A major problem left open in [11] was the characterization of regret for directed observability graphs, a setting for which they only proved partial results.

Our main result is a full characterization (to within logarithmic factors) of regret in the case of directed and dynamic observability graphs. Our upper bounds are proven using a new algorithm, called Exp3-DOM. This algorithm is efficient to run even in the dynamic case: it just needs to compute a small dominating set of the current observability graph (which must be given as side information) before prediction.111 Computing an approximately minimum dominating set can be done by running a standard greedy set cover algorithm, see Section 2. As in the undirected case, the regret for the directed case is characterized in terms of the independence numbers of the observability graphs (computed ignoring edge directions). We arrive at this result by showing that a key quantity emerging in the analysis of Exp3-DOM can be bounded in terms of the independence numbers of the graphs. This bound (Lemma 13 in the appendix) is based on a combinatorial construction which might be of independent interest.

We also explore the possibility of the learning algorithm receiving the observability graph only after prediction, and not before. For this setting, we introduce a new variant of Exp3, called Exp3-SET, which achieves the same regret as ELP for undirected graphs, but without the need of accessing the current observability graph before each prediction. We show that in some random directed graph models Exp3-SET has also a good performance. In general, we can upper bound the regret of Exp3-SET as a function of the maximum acyclic subgraph of the observability graph, but this upper bound may not be tight. Yet, Exp3-SET is much simpler and computationally less demanding than ELP, which needs to solve a linear program in each round.

There are a variety of real-world settings where partial observability models corresponding to directed and undirected graphs are applicable. One of them is route selection. We are given a graph of possible routes connecting cities: when we select a route connecting two cities, we observe the cost (say, driving time or fuel consumption) of the “edges” along that route and, in addition, we have complete information on any sub-route of , but not vice versa. We abstract this in our model by having an observability graph over routes , and an arc from to any of its sub-routes .

Sequential prediction problems with partial observability models also arise in the context of recommendation systems. For example, an online retailer, which advertises products to users, knows that users buying certain products are often interested in a set of related products. This knowledge can be represented as a graph over the set of products, where two products are joined by an edge if and only if users who buy any one of the two are likely to buy the other as well. In certain cases, however, edges have a preferred orientation. For instance, a person buying a video game console might also buy a high-def cable to connect it to the TV set. Vice versa, interest in high-def cables need not indicate an interest in game consoles.

Such observability models may also arise in the case when a recommendation system operates in a network of users. For example, consider the problem of recommending a sequence of products, or contents, to users in a group. Suppose the recommendation system is hosted on an online social network, on which users can befriend each other. In this case, it has been observed that social relationships reveal similarities in tastes and interests [12]

. However, social links can also be asymmetric (e.g., followers of celebrities). In such cases, followers might be more likely to shape their preferences after the person they follow, than the other way around. Hence, a product liked by a celebrity is probably also liked by his/her followers, whereas a preference expressed by a follower is more often specific to that person.

2 Learning protocol, notation, and preliminaries

As stated in the introduction, we consider an adversarial multi-armed bandit setting with a finite action set . At each time , a player (the “learning algorithm”) picks some action and incurs a bounded loss . Unlike the standard adversarial bandit problem [3, 6], where only the played action reveals its loss , here we assume all the losses in a subset of actions are revealed after is played. More formally, the player observes the pairs for each . We also assume for any and , that is, any action reveals its own loss when played. Note that the bandit setting () and the expert setting () are both special cases of this framework. We call the observation set of action at time , and write when at time playing action also reveals the loss of action . Hence, . The family of observation sets we collectively call the observation system at time .

The adversaries we consider are nonoblivious. Namely, each loss at time can be an arbitrary function of the past player’s actions . The performance of a player is measured through the regret

where and are the cumulative losses of the player and of action , respectively. The expectation is taken with respect to the player’s internal randomization (since losses are allowed to depend on the player’s past random actions, also may be random).222 Although we defined the problem in terms of losses, our analysis can be applied to the case when actions return rewards via the transformation . The observation system is either adversarially generated (in which case, each can be an arbitrary function of past player’s actions, just like losses are), or randomly generated —see Section 3. In this respect, we distinguish between adversarial and random observation systems.

Moreover, whereas some algorithms need to know the observation system at the beginning of each step , others need not. From this viewpoint, we shall consider two online learning settings. In the first setting, called the informed setting, the whole observation system selected by the adversary is made available to the learner before making its choice . This is essentially the “side-information” framework first considered in [11] In the second setting, called the uninformed setting, no information whatsoever regarding the time- observation system is given to the learner prior to prediction.

We find it convenient to adopt the same graph-theoretic interpretation of observation systems as in [11]. At each time step , the observation system defines a directed graph , where is the set of actions, and

is the set of arcs, i.e., ordered pairs of nodes. For

, arc if and only if (the self-loops created by are intentionally ignored). Hence, we can equivalently define in terms of . Observe that the outdegree of any equals . Similarly, the indegree of is the number of action such that (i.e., such that ). A notable special case of the above is when the observation system is symmetric over time: if and only if for all and . In words, playing at time reveals the loss of if and only if playing at time reveals the loss of . A symmetric observation system is equivalent to being an undirected graph or, more precisely, to a directed graph having, for every pair of nodes , either no arcs or length-two directed cycles. Thus, from the point of view of the symmetry of the observation system, we also distinguish between the directed case ( is a general directed graph) and the symmetric case ( is an undirected graph for all ). For instance, combining the terminology introduced so far, the adversarial, informed, and directed setting is when is an adversarially-generated directed graph disclosed to the algorithm in round before prediction, while the random, uninformed, and directed setting is when is a randomly generated directed graph which is not given to the algorithm before prediction.

The analysis of our algorithms depends on certain properties of the sequence of graphs . Two graph-theoretic notions playing an important role here are those of independent sets and dominating sets. Given an undirected graph , an independent set of is any subset such that no two are connected by an edge in . An independent set is maximal if no proper superset thereof is itself an independent set. The size of a largest (maximal) independent set is the independence number of , denoted by . If is directed, we can still associate with it an independence number: we simply view as undirected by ignoring arc orientation. If is a directed graph, then a subset is a dominating set for if for all there exists some such that arc . In our bandit setting, a time- dominating set is a subset of actions with the property that the loss of any remaining action in round can be observed by playing some action in . A dominating set is minimal if no proper subset thereof is itself a dominating set. The domination number of directed graph , denoted by , is the size of a smallest (minimal) dominating set of .

Computing a minimum dominating set for an arbitrary directed graph is equivalent to solving a minimum set cover problem on the associated observation system . Although minimum set cover is NP-hard, the well-known Greedy Set Cover algorithm [7], which repeatedly selects from the set containing the largest number of uncovered elements so far, computes a dominating set such that .

Finally, we can also lift the independence number of an undirected graph to directed graphs through the notion of maximum acyclic subgraphs: Given a directed graph , an acyclic subgraph of is any graph such that , and , with no (directed) cycles. We denote by the maximum size of such . Note that when is undirected (more precisely, as above, when is a directed graph having for every pair of nodes either no arcs or length-two cycles), then , otherwise . In particular, when is itself a directed acyclic graph, then .

3 Algorithms without Explicit Exploration: The Uninformed Setting

In this section, we show that a simple variant of the Exp3 algorithm [3] obtains optimal regret (to within logarithmic factors) in two variants of the uninformed setting: (1) adversarial and symmetric, (2) random and directed. We then show that even the harder adversarial and directed setting lends itself to an analysis, though with a weaker regret bound.

algocf[t]     Exp3-SET (Algorithm LABEL:a:lossalg

) runs Exp3 without mixing with the uniform distribution. Similar to Exp3, Exp3-SET uses loss estimates

that divide each observed loss by the probability of observing it. This probability is simply the sum of all such that (the sum includes ). Next, we bound the regret of Exp3-SET in terms of the key quantity


Each term can be viewed as the probability of drawing from conditioned on the event that was observed. Similar to [11], a key aspect to our analysis is the ability to deterministically (and nonvacuously)333 An obvious upper bound on is . upper bound in terms of certain quantities defined on . We shall do so in two ways, either irrespective of how small each may be (this section) or depending on suitable lower bounds on the probabilities (Section 4). In fact, forcing lower bounds on is equivalent to adding exploration terms to the algorithm, which can be done only when knowing before each prediction —an information available only in the informed setting.

The following simple result is the building block for all subsequent results in the uninformed setting.444 All proofs are given in the appendix.

Theorem 1

In the adversarial case, the regret of Exp3-SET satisfies

As we said, in the adversarial and symmetric case the observation system at time can be described by an undirected graph . This is essentially the problem of [11], which they studied in the easier informed setting, where the same quantity above arises in the analysis of their ELP algorithm. In their Lemma 3, they show that , irrespective of the choice of the probabilities . When applied to Exp3-SET, this immediately gives the following result.

Corollary 2

In the adversarial and symmetric case, the regret of Exp3-SET satisfies

In particular, if for constants we have , , then setting , gives

As shown in [11], the knowledge of for tuning can be dispensed with (at the cost of extra log factors in the bound) by binning the values of and running Exp3 on top of a pool of instances of Exp-SET, one for each bin. The bounds proven in Corollary 2 are equivalent to those proven in [11] (Theorem 2 therein) for the ELP algorithm. Yet, our analysis is much simpler and, more importantly, our algorithm is simpler and more efficient than ELP, which requires solving a linear program at each step. Moreover, unlike ELP, Exp-SET does not require prior knowledge of the observation system at the beginning of each step.

We now turn to the directed setting. We first treat the random case, and then the harder adversarial case.

The Erdős-Renyi model is a standard model for random directed graphs , where we are given a density parameter and, for any pair , arc with independent probability .555 Self loops, i.e., arcs are included by default here. We have the following result.

Corollary 3

Let be generated according to the Erdős-Renyi model with parameter . Then the regret of Exp3-SET satisfies

In the above, the expectations are w.r.t. both the algorithm’s randomization and the random generation of occurring at each round. In particular, setting , gives

Note that as ranges in we interpolate between the bandit ()666 Observe that . and the expert () regret bounds.

In the adversarial setting, we have the following result.

Corollary 4

In the adversarial and directed case, the regret of Exp3-SET satisfies

In particular, if for constants we have , , then setting , gives

Observe that Corollary 4 is a strict generalization of Corollary 2 because, as we pointed out in Section 2, , with equality holding when is an undirected graph.

As far as lower bounds are concerned, in the symmetric setting, the authors of [11] derive a lower bound of in the case when for all . We remark that similar to the symmetric setting, we can derive a lower bound of . The simple observation is that given a directed graph , we can define a new graph which is made undirected just by reciprocating arcs; namely, if there is an arc in we add arcs and in . Note that . Since in the learner can only receive more information than in , any lower bound on also applies to . Therefore we derive the following corollary to the lower bound of [11] (Theorem 4 therein).

Corollary 5

Fix a directed graph , and suppose for all . Then there exists a (randomized) adversarial strategy such that for any and for any learning strategy, the expected regret of the learner is .

One may wonder whether a sharper lower bound argument exists which applies to the general directed setting and involves the larger quantity . Unfortunately, the above measure does not seem to be related to the optimal regret: Using Claim 1 in the appendix (see proof of Theorem 3) one can exhibit a sequence of graphs each having a large acyclic subgraph, on which the regret of Exp3-SET is still small.

The lack of a lower bound matching the upper bound provided by Corollary 4 is a good indication that something more sophisticated has to be done in order to upper bound in (1). This leads us to consider more refined ways of allocating probabilities to nodes. However, this allocation will require prior knowledge of the graphs .

4 Algorithms with Explicit Exploration: The Informed Setting

We are still in the general scenario where graphs are arbitrary and directed, but now is made available before prediction. We start by showing a simple example where our analysis of Exp3-SET inherently fails. This is due to the fact that, when the graph induced by the observation system is directed, the key quantity defined in (1) cannot be nonvacuously upper bounded independent of the choice of probabilities . A way round it is to introduce a new algorithm, called Exp3-DOM, which controls probabilities by adding an exploration term to the distribution . This exploration term is supported on a dominating set of the current graph . For this reason, Exp3-DOM requires prior access to a dominating set at each time step which, in turn, requires prior knowledge of the entire observation system .

As announced, the next result shows that, even for simple directed graphs, there exist distributions on the vertices such that is linear in the number of nodes while the independence number is .777 In this specific example, the maximum acyclic subgraph has size , which confirms the looseness of Corollary 4. Hence, nontrivial bounds on can be found only by imposing conditions on distribution .

Fact 6

Let be a total order on , i.e., such that for all , arc for all . Let be a distribution on such that , for and . Then

We are now ready to introduce and analyze the new algorithm Exp3-DOM for the adversarial, informed and directed setting. Exp3-DOM (see Algorithm LABEL:a:exp3dom) runs variants of Exp3 indexed by . At time the algorithm is given observation system , and computes a dominating set of the directed graph induced by . Based on the size of , the algorithm uses instance to pick action . We use a superscript to denote the quantities relevant to the variant of Exp3 indexed by . Similarly to the analysis of Exp3-SET, the key quantities are

Let . Clearly, the sets are a partition of the time steps , so that . Since the adversary adaptively chooses the dominating sets , the sets are random. This causes a problem in tuning the parameters . For this reason, we do not prove a regret bound for Exp3-DOM, where each instance uses a fixed , but for a slight variant (described in the proof of Theorem 7 —see the appendix) where each is set through a doubling trick. algocf[t]    

Theorem 7

In the adversarial and directed case, the regret of Exp3-DOM satisfies


Moreover, if we use a doubling trick to choose for each , then


Importantly, the next result shows how bound (3) of Theorem 7 can be expressed in terms of the sequence of independence numbers of graphs whenever the Greedy Set Cover algorithm [7] (see Section 2) is used to compute the dominating set of the observation system at time .

Corollary 8

If Step 2 of Exp3-DOM uses the Greedy Set Cover algorithm to compute the dominating sets , then the regret of Exp-DOM with doubling trick satisfies

where, for each , is the independence number of the graph induced by observation system .

5 Conclusions and work in progress

We have investigated online prediction problems in partial information regimes that interpolate between the classical bandit and expert settings. We have shown a number of results characterizing prediction performance in terms of: the structure of the observation system, the amount of information available before prediction, the nature (adversarial or fully random) of the process generating the observation system. Our results are substantial improvements over the paper [11] that initiated this interesting line of research. Our improvements are diverse, and range from considering both informed and uninformed settings to delivering more refined graph-theoretic characterizations, from providing more efficient algorithmic solutions to relying on simpler (and often more general) analytical tools.

Some research directions we are currently pursuing are the following.

  1. We are currently investigating the extent to which our results could be applied to the case when the observation system may depend on the loss of player’s action

    . Notice that this would prevent a direct construction of an unbiased estimator for unobserved losses, which many worst-case bandit algorithms (including ours —see the appendix) hinge upon.

  2. The upper bound contained in Corollary 4 and expressed in terms of is almost certainly suboptimal, even in the uninformed setting, and we are trying to see if more adequate graph complexity measures can be used instead.

  3. Our lower bound (Corollary 5) heavily relies on the corresponding lower bound in [11] which, in turn, refers to a constant graph sequence. We would like to provide a more complete charecterization applying to sequences of adversarially-generated graphs in terms of sequences of their corresponding independence numbers (or variants thereof), in both the uninformed and the informed settings.


The first author was supported in part by an ERC advanced grant, by a USA-Israeli BSF grant, and by the Israeli I-CORE program. The second author acknowledges partial support by MIUR (project ARS TechnoMedia, PRIN 2010-2011, grant no. 2010N5K7EB_003). The fourth author was supported in part by a grant from the Israel Science Foundation, a grant from the United States-Israel Binational Science Foundation (BSF), a grant by Israel Ministry of Science and Technology and the Israeli Centers of Research Excellence (I-CORE) program (Center No. 4/11).


  • [1] N. Alon and J. H. Spencer. The probabilistic method. John Wiley & Sons, 2004.
  • [2] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, 2009.
  • [3] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
  • [4] Y. Caro. New results on the independence number. In Tech. Report, Tel-Aviv University, 1979.
  • [5] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. J. ACM, 44(3):427–485, 1997.
  • [6] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
  • [7] V. Chvatal.

    A greedy heuristic for the set-covering problem.

    Mathematics of Operations Research, 4(3):233–235, 1979.
  • [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Euro-COLT, pages 23–37. Springer-Verlag, 1995. Also, JCSS 55(1): 119-139 (1997).
  • [9] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71:291–307, 2005.
  • [10] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
  • [11] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. In 25th Annual Conference on Neural Information Processing Systems (NIPS 2011), 2011.
  • [12] Alan Said, Ernesto W De Luca, and Sahin Albayrak. How social relationships affect user similarities. In Proceedings of the International Conference on Intelligent User Interfaces Workshop on Social Recommender Systems, Hong Kong, 2010.
  • [13] V. G. Vovk. Aggregating strategies. In COLT, pages 371–386, 1990.
  • [14] V. K. Wey. A lower bound on the stability number of a simple graph. In Bell Lab. Tech. Memo No. 81-11217-9, 1981.

Appendix A Technical lemmas and proofs

This section contains the proofs of all technical results occurring in the main text, along with ancillary graph-theoretic lemmas. Throughout this appendix, is a shorthand for .

Proof of Theorem 1. Following the proof of Exp3 [3], we have

Taking logs, using for all , and summing over yields

Moreover, for any fixed comparison action , we also have

Putting together and rearranging gives


Note that, for all ,


Hence, taking expectations on both sides of (4), and recalling the definition of , we can write


Finally, taking expectations to remove conditioning gives

as claimed.

Proof of Corollary 3. Fix round , and let be the Erdős-Renyi random graph generated at time , be the in-neighborhood of node , i.e., the set of nodes such that , and denote by the indegree of .

Claim 1


be an arbitrary probability distribution defined over

, be an arbitrary permutation of , and denote the expectation w.r.t. permutation when is drawn uniformly at random. Then, for any , we have

Proof of Claim 1. Consider selecting a subset of nodes. We shall consider the contribution to the expectation when . Since there are terms (out of ) contributing to the expectation, we can write

Claim 2

Let be an arbitrary probability distribution defined over , and denote the expectation w.r.t. the Erdős-Renyi random draw of arcs at time . Then, for any fixed , we have

Proof of Claim 2. For the given and time

, consider the Bernoulli random variables

, and denote by the expectation w.r.t. all of them. We symmetrize by means of a random permutation , as in Claim 1. We can write

At this point, we follow the proof of Theorem 1 up until (5). We take an expectation w.r.t. the randomness in generating the sequence of graphs . This yields

We use Claim 2 to upper bound by , and take the outer expectation to remove conditioning, as in the proof of Theorem 1. This concludes the proof.

The following lemma can be seen as a generalization of Lemma 3 in [11].

Lemma 9

Let be a directed graph with vertex set , and arc set . Let be the in-neighborhood of node , i.e., the set of nodes such that . Then

Proof. We will show that there is a subset of vertices such that the induced graph is acyclic and .

We prove the lemma by growing set starting off from . Let

and be the vertex which minimizes over . We are going to delete from the graph, along with all its incoming neighbors (set ), and all edges which are incident (both departing and incoming) to these nodes, and then iterating on the remaining graph. Let us denote the in-neighborhoods of the shrunken graph from the first step by .

The contribution of all the deleted vertices to is

where the inequality follows from the minimality of .

Let , and . Then from the first step we have

We apply the very same argument to with node (minimizing over ), to with node , …, to with node , up until , i.e., up until no nodes are left in the shrunken graph. This gives , where . Moreover, since in each step we remore all remaining arcs incoming to , the graph induced by set cannot contain cycles.

Proof of Corollary 4. The claim follows from a direct combination of Theorem 1 with Lemma 9.

Proof of Fact 6. Using standard properties of geometric sums, one can immediately see that

hence the claimed result.

The following graph-theoretic lemma turns out to be fairly useful for analyzing directed settings. It is a directed-graph counterpart to a well-known result [4, 14] holding for undirected graphs.

Lemma 10

Let be a directed graph, with . Let be the indegree of node , and be the independence number of . Then

Proof. We will proceed by induction, starting off from the original -node graph with indegrees , and independence number , and then progressively shrink by eliminating nodes and incident (both departing and incoming) arcs, thereby obtaining a sequence of smaller and smaller graphs , and associated indegrees , , , …, and independence numbers . Specifically, in step we sort nodes of in nonincreasing value of , and obtain from by eliminating node (i.e., one having the largest indegree among the nodes of ), along with its incident arcs. On all such graphs, we will use the classical Turan’s theorem (e.g., [1]) stating that any undirected graph with nodes and edges has an independent set of size at least