Path Planning Problems with Side Observations-When Colonels Play Hide-and-Seek

Resource allocation games such as the famous Colonel Blotto (CB) and Hide-and-Seek (HS) games are often used to model a large variety of practical problems, but only in their one-shot versions. Indeed, due to their extremely large strategy space, it remains an open question how one can efficiently learn in these games. In this work, we show that the online CB and HS games can be cast as path planning problems with side-observations (SOPPP): at each stage, a learner chooses a path on a directed acyclic graph and suffers the sum of losses that are adversarially assigned to the corresponding edges; and she then receives semi-bandit feedback with side-observations (i.e., she observes the losses on the chosen edges plus some others). We propose a novel algorithm, EXP3-OE, the first-of-its-kind with guaranteed efficient running time for SOPPP without requiring any auxiliary oracle. We provide an expected-regret bound of EXP3-OE in SOPPP matching the order of the best benchmark in the literature. Moreover, we introduce additional assumptions on the observability model under which we can further improve the regret bounds of EXP3-OE. We illustrate the benefit of using EXP3-OE in SOPPP by applying it to the online CB and HS games.

Authors

• 7 publications
• 18 publications
• 9 publications
• 26 publications
05/27/2019

Colonel Blotto and Hide-and-Seek Games as Path Planning Problems with Side Observations

Resource allocation games such as the famous Colonel Blotto (CB) and Hid...
05/27/2019

Colonel Blotto Games and Hide-and-Seek Games as Path Planning Problems with Side Observations

Resource allocation games such as the famous Colonel Blotto (CB) and Hid...
09/11/2019

Combinatorial Bandits for Sequential Learning in Colonel Blotto Games

The Colonel Blotto game is a renowned resource allocation problem with a...
03/23/2021

Bandit Learning for Dynamic Colonel Blotto Game with a Budget Constraint

We consider a dynamic Colonel Blotto game (CBG) in which one of the play...
01/15/2020

Offline Grid-Based Coverage path planning for guards in games

Algorithmic approaches to exhaustive coverage have application in video ...
07/03/2003

BL-WoLF: A Framework For Loss-Bounded Learnability In Zero-Sum Games

We present BL-WoLF, a framework for learnability in repeated zero-sum ga...
07/10/2020

Learning to Play Sequential Games versus Unknown Opponents

We consider a repeated sequential game between a learner, who plays firs...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Resource allocation games have been studied profoundly in the literature and showed to be very useful to model many practical situations, including online decision problems, see e.g. [7, 9, 20, 35]. In particular, two of the most renowned are the Colonel Blotto game (henceforth, CB game) and the Hide-and-Seek game (henceforth, HS game). In the (one-shot) CB game, two players, each with a fixed amount of budget, simultaneously allocate their (indivisible) resources on battlefields, each player’s payoff is the aggregate of the values of battlefields where she has a higher allocation. The scope of applications of the CB games includes a variety of problems; for instance, in security where resources correspond to security forces (e.g., [12, 27]), in politics where budget are distributed to attract voters (e.g., [21, 25]), and in advertising for distributing the ads’ broadcasting time (e.g., [23, 24]). On the other hand, in the (one-shot) HS game, a seeker chooses among locations (

) to search for a hider, who chooses the probability of hiding in each location. The seeker’s payoff is the summation of the probability that the hider hides in the chosen locations and the hider’s payoff is the probability that she successfully escapes the seeker’s pursuit. Several variants of the HS games are used to model surveillance situations

[6], anti-jamming problems [32], vehicles control [29], etc.

Both the CB and the HS games have a long-standing history (originated by [8] and [30], respectively); however, the results achieved so-far in these games are mostly limited to their one-shot and full-information version (see e.g., [5, 15, 25, 27, 31] for CB games and [17, 33] for HS games). On the contrary, in most of the applications (e.g., telecommunications, web security, advertising), a more natural setting is to consider the case where the game is played repeatedly and players have access only to incomplete information at each stage. In this setting, players are often required to sequentially learn the game on-the-fly and adjust the trade-off between exploiting known information and exploring to gain new information. Thus, this work focuses on the following sequential learning problems:

The online CB game: fix ; at each stage, a learner who has the budget plays a CB game against some adversaries across battlefields; at the end of the stage, she receives limited feedback that is the gain (loss) she obtains from each battlefield (but not the adversaries’ strategies). The battlefields’ values can change over time and they are unknown to the learner before making the decision at each stage. This setting is generic and covers many applications of the CB game. For instance, in radio resource allocation problem (in a cognitive radio network), a solution that balances between efficiency and fairness is to provide the users fictional budgets (the same budget at each stage) and let them bid across spectrum carriers simultaneously to compete for obtaining as many bandwidth portions as possible, the highest bidder to each carrier wins the corresponding bandwidth (see e.g., [13]). At the end of each stage, each user observes her own data rate (the gain/loss) achieved via each carrier (corresponding to battlefields’ values) but does not know other users’ bids. Note that the actual data rate can be noisy and change over time. Moreover, users can enter and leave the system so no stochastic assumption shall be made for the adversaries’ decisions.

The online HS game: fix (such that ); at each stage, the learner is a seeker who plays the same HS game (with and ) against an adversary; at the end of the stage, the seeker only observes the gains/losses she suffers from the locations she chose. This setting is practical and one of the motivational examples is the spectrum sensing problem in opportunistic spectrum access context (see e.g., [34]). At each stage, a secondary user (the learner) chooses to send the sensing signal to at most among channels (due to energy constraints, she cannot sense all channels) with the objective of sensing the channels with the availability as high as possible. The leaner can only measure the reliably (the gain/loss) of the channels that she sensed. Note that the channels’ availability depend on primary users’ decisions that is non-stochastic.

A formal definition of these problems is given in Section 4; hereinafter, we reuse the term CB game and HS game to refer to this sequential learning version of the games. The main challenge here is that the strategy space is exponential in the natural parameters (e.g., number of troops and battlefields in the CB game, number of locations in the HS game); hence how to efficiently learn in these games is an open question.

Our first contribution towards solving this open question is to show that the CB and HS games can be cast as a Path Planning Problem (henceforth, PPP), one of the most well-studied instances of the

Online Combinatorial Optimization

framework (henceforth, OComb; see [11] for a survey). In PPPs, given a graph with edges, at each stage, a learner chooses a path; then a loss is adversarially chosen for each edge and the learner suffers the aggregate of edges’ losses belonging to her chosen path. The learner’s goal is to minimize regret. The information that the learner receives in the CB and HS games as described above straightforwardly corresponds to the so-called semi-bandit feedback setting of PPPs, i.e., at the end of each stage, the learner observes the edges’ losses belonging to her chosen path (see Section 4 for more details). However, the specific structure of the considered games also allows the learner to deduce (without any extra cost) from the semi-bandit feedback the losses of some of the other edges that may not belong to the chosen path; these are called side-observations. Henceforth, we will use the term SOPPP to refer to this PPP under semi-bandit feedback with side-observations.

SOPPP is a special case of OComb with side-observations (henceforth, SOComb) studied by [19] and, following their approach, we will use observation graphs111The observation graphs, proposed by [19] and used here for SOPPP, extend the side-observations model for multi-armed bandits problems studied by [1, 2, 22]. Indeed, they capture side-observations between edges whereas the side-observations model considered by [1, 2, 22] is between actions, i.e., paths in PPPs. (defined in Section 2) to capture the learner’s observability. [19] focuses on the class of Follow-the-Perturbed-Leader (FPL) algorithms (originated from [18]) and proposes an algorithm named FPL-IX for SOComb, which could be applied directly to SOPPP. However, this faces two main problems: () the efficiency of FPL-IX is only guaranteed with high-probability (as it depends on the geometric sampling technique) and it is still super-linear in terms of the time horizon, thus there is still room for improvements; () FPL-IX requires that there exists an efficient oracle that solves an optimization problem at each stage. Both of these issues are incompatible with our goal of learning in the CB and HS games: although the probability that FPL-IX fails to terminate is small, this could lead to issues in implementing it in practice where the learner is obliged to quickly give a decision in each stage; it is unclear which oracle should be used in applying FPL-IX to the CB and HS games.

In this paper, we focus instead on another prominent class of OComb algorithms, called Exp3 [4, 14]. One of the key open questions in this field is how to design a variant of Exp3 with efficient running time and good regret guarantees for OComb problems in each feedback setting (see, e.g., [10]). Then, our second contribution is to propose an Exp3-type algorithm for SOPPPs that solves both of the aforementioned issues of FPL-IX and provides good regret guarantees; i.e., we give an affirmative answer to an important subset of the above-mentioned open problem. In more details, this contribution is three-fold: We propose a novel algorithm, Exp3-OE, that is applicable to any instance of SOPPP. Importantly, Exp3-OE is always guaranteed to run efficiently (i.e., in polynomial time in terms of the number of edges of the graph in SOPPP) without the need of any auxiliary oracle; We prove that Exp3-OE guarantees an upper-bound on the expected regret matching in order with the best benchmark in the literature (the FPL-IX algorithm). We also prove further improvements under additional assumptions on the observation graphs that have been so-far ignored in the literature; We demonstrate the benefit of using the Exp3-OE algorithm in the CB and HS games.

Note importantly that the SOPPP model (and the Exp3-OE algorithm) can be applied into many problems beyond the CB and HS games, e.g., auctions, recommendation systems. To highlight this and for the sake of conciseness, we first study the generic model of SOPPP in Section 2 and present our second contribution in Section 3, i.e., the Exp3-OE algorithm in SOPPPs; we delay the formal definition of the CB and HS games, together with the analysis on running Exp3-OE in these games (i.e., our first contribution) to Section 4

. Throughout the paper, we use bold symbols to denote vectors, e.g.,

, and to denote the -th element. For any , the set is denoted by and the indicator function of a set is denoted by . For graphs, we write either or to refer that an edge belongs to a path . Finally, we use as a version of the big-O asymptotic notation that ignores the logarithmic terms.

2 Path Planning Problems with Side-Observations (SOPPP) Formulation

As discussed in Section 1, motivated by the CB and HS games, we propose the path planning problem with semi-bandit and side-observations feedback (SOPPP).

SOPPP model. Consider a directed acyclic graph (henceforth, DAG), denoted by , whose set of vertices and set of edges are respectively denoted by and . Let and ; there are two special vertices, a source and a destination, that are respectively called and . We denote by the set of all paths starting from and ending at ; let us define . Each path corresponds to a vector in (thus, ) where if and only if edge belongs to . Let be the length of the longest path in , that is . Given a time horizon , at each (discrete) stage , a learner chooses a path . Then, a loss vector is secretly and adversarially chosen. Each element corresponds to the scalar loss embedded on the edge . Note that we consider the non-oblivious adversary, i.e., can be an arbitrary function of the learner’s past actions , but not .222This setting is considered by most of the works in the non-stochastic/adversarial bandits literature, e.g., [2, 10]. The learner’s incurred loss is , i.e., the sum of the losses from the edges belonging to . The learner’s feedback at stage after choosing is presented as follows. First, she receives a semi-bandit feedback, that is, she observes the edges’ losses , for any belonging to the chosen path . Additionally, each edge may reveal the losses on several other edges. To represent these side-observations at time , we consider a graph, denoted , containing vertices. Each vertex of corresponds to an edge of the graph . There exists a directed edge from a vertex to a vertex in if, by observing the edge loss , the learner can also deduce the edge loss ; we also denote this by  and say that the edge reveals the edge . The objective of the learner is to minimize the cumulative expected regret, defined as .

Hereinafter, in places where there is no ambiguity, we use the term path to refer to a path in and the term observation graphs to refer to . In general, these observation graphs can depend on the decisions of both the learner and the adversary. On the other hand, all vertices in always have self-loops. In the case where none among contains any other edge than these self-loops, no side-observation is allowed and the problem is reduced to the classical semi-bandit setting. If all are complete graphs, SOPPP corresponds to the full-information PPPs. In this work, we focus on considering the uninformed setting, i.e., the learner observes only after making a decision at time . On the other hand, let us introduce two new notations:

 Ot(e):={p∈P:∃e′∈p,e′→e},∀e∈E, Ot(p):={e∈E:∃e′∈p,e′→e},∀p∈P.

Intuitively, is the set of all paths that, if chosen, reveal the loss on the edge and is the set of all edges whose losses are revealed if the path is chosen. Trivially, . Moreover, due to the semi-bandit feedback, if , then and . Apart from the results for general observation graphs, in this work, we additionally present several results under two particular assumptions, satisfied by some instances in practice (e.g., the CB and HS games), that provide more refined regret bounds compared to cases that were considered by [19]:  symmetric observation graphs where for each edge from to , there also exists an edge from to (i.e., if then ); i.e., is an undirected graph;  observation graphs under the following assumption  that requires that if two edges belong to a path in , then they cannot simultaneously reveal the loss of another edge.

Assumption : For any , if and , then .

3 Exp3-OE - An Efficient Algorithm for the SOPPP

In this section, we present a new algorithm for SOPPP, called Exp3-OE (OE stands for Observable Edges), whose pseudo-code is given by Algorithm 3. The guarantees on the expected regret of Exp3-OE in SOPPP is analyzed in Section 3.2. Moreover, Exp3-OE always runs efficiently in polynomial time in terms of the number of edges of ; this is discussed in Section 3.1.

As an Exp3-type algorithm, Exp3-OE relies on the average weights sampling where at stage we update the weight on each edge by the exponential rule (line ). For each path , we denote the path weight and define the following terms:

 xt(p):=∏e∈pwt(e)∑p′∈P∏e′∈p′wt(e′)=wt(p)∑p′∈Pwt(p′),∀p∈P. (1)

Line 5 of Exp3-OE involves a sub-algorithm, called the WPS algorithm, that samples a path with probability (the sampled path is then denoted by ) from any input at each stage . This algorithm is based on a classical technique called weight pushing (see e.g., [28, 16]). We discuss further details and present an explicit formulation of the WPS algorithm in Appendix A).

Compared to other instances of the Exp3-type algorithms, Exp3-OE has two major differences. First, at each stage , the loss of each edge

is estimated by

(line ) based on the term and a parameter . Intuitively, is the probability that the loss on the edge is revealed from playing the chosen path at . Second, the implicit exploration parameter added to the denominator allows us to “pretend to explore” in Exp3-OE without knowing the observation graph before making the decision at stage (the uninformed setting). Unlike the standard Exp3, the loss estimator used in Exp3-OE is biased, i.e., for any ,

 Et[^ℓt(e)] =∑~p∈Pxt(~p)ℓt(e)qt(e)+βI{e∈Ot(~p)} =∑~p∈Ot(e)xt(~p)ℓt(e)∑p∈Ot(e)xt(p)+β≤ℓt(e). (2)

Here, denotes the expectation w.r.t. the randomness of choosing a path at stage . Second, unlike standard Exp3 algorithms that keep track and update on the weight of each path, the weight pushing technique is applied at line (via the WPS algorithm) and line (via Algorithm 3.1 in Section 3.1) where we work with edges weights instead of paths weights (recall that ).

3.1 Running Time Efficiency of the Exp3-OE Algorithm

In the WPS algorithm mentioned above, it is needed to compute the terms and for any vertex in . Intuitively, is the aggregate weight of all paths from vertex to vertex  at stage . These terms can be computed recursively in time based on dynamic programming. This computation is often referred to as weight pushing. Following the literature, we present in Appendix A an explicit algorithm that outputs from any input , called the WP algorithm. Then, a path in is sampled sequentially edge-by-edge based on these terms by the WPS algorithm. Importantly, the WP and WPS algorithms run efficiently in time.

The final non-trivial step to efficiently implement Exp3-OE is to compute in line , i.e., the probability that an edge is revealed at stage . Note that is the sum of terms; therefore, a direct computation is inefficient while a naive application of the weight pushing technique can easily lead to errors. To compute , we propose Algorithm 3.1, a non-straightforward application of weight pushing, in which we consecutively consider all the edges . Then, we take the sum of the terms of the paths going through by the weight pushing technique while making sure that each of these terms is included only once, even if has more than one edge revealing  (this is a non-trivial step). In Algorithm 3.1, we denote by the set of the direct successors of any vertex . We give a proof that Algorithm 3.1 outputs exactly as defined in line of Algorithm 3 in Appendix B. Algorithm 3.1 runs in time; therefore, line of Algorithm 3 can be done in at most  time.

In conclusion, Exp3-OE runs in at most time, this guarantee works even for the worst-case scenario. For comparison, the FPL-IX algorithm runs in time in expectation and in time with a probability at least for an arbitrary .333If one runs FPL-IX with Dijkstra’s algorithm as the optimization oracle and with parameters chosen by [19] That is, FPL-IX might fail to terminate with a strictly positive probability444A stopping criterion for FPL-IX can be chosen to avoid this issue but it raises the question on how one chooses the criterion such that the regret guarantees hold. and it is not guaranteed to have efficient running time in all cases. Moreover, although this complexity bound of FPL-IX is slightly better in terms of , the complexity bound of Exp3-OE improves that by a factor of . As is often the case in no-regret analysis, we consider the setting where T is significantly larger than other parameters of the problems; this is also consistent with the motivational applications of the CB and HS games presented in Section 1. Therefore, our contribution in improving the algorithm’s running time in terms of is relevant.

3.2 Performance of the Exp3-OE Algorithm

In this section, we present an upper-bound of the expected regret achieved by the Exp3-OE algorithm in the SOPPP. For the sake of brevity, with defined in (1), for any and , we denote:

 rt(e):=∑p∋ext(p) % and Qt:=∑e∈Ert(e)/(qt(e)+β).

Intuitively, is the probability that the chosen path at stage contains an edge and is the summation over all the edges of the ratio of this quantity and the probability that the loss of an edge is revealed (plus ). We can bound the expected regret with this key term .

Theorem 3.1.

The expected regret of the Exp3-OE algorithm in the SOPPP satisfies:

 RT≤ln(P)/η+[β+(n⋅η)/2]⋅∑t∈[T]Qt. (3)

A complete proof of Theorem 3.1 can be found in Appendix C and has an approach similar to [2, 10] with several necessary adjustments to handle the new biased loss estimator in Exp3-OE. To see the relationship between the structure of the side-observations of the learner and the bound of the expected regret, we look for the upper-bounds of in terms of the observation graphs’ parameters. Let be the independence number555The independence number of a directed graph is computed while ignoring the direction of the edges. of , we have the following statement.

Theorem 3.2.

Let us define , and . Upper-bounds of in different cases of  are given in the following table:

A proof of this theorem is given in Appendix E. The main idea of this proof is based on several graph theoretical lemmas that are extracted from [2, 19, 22]. These lemmas establish the relationship between the independence number of a graph and the ratios of the weights on the graph’s vertices that have similar forms to the key-term . The case where observation graphs are non-symmetric and do not satisfy assumption  is the most general setting. Moreover, as showed in Theorem 3.2, the bounds of are improved if the observation graphs satisfy either the symmetry condition or assumption . Intuitively, given the same independence numbers, a symmetric observation graph gives the learner more information than a non-symmetric one; thus, it yields a better bound on and the expected regret. On the other hand, assumption  is a technical assumption that allows the use of different techniques in the proofs to obtain better bounds. These cases have not been explicitly analyzed in the literature while they are satisfied by several practical situations, including the CB and HS games (see Section 4).

Finally, we give results on the upper-bounds of the expected regret, obtained by the Exp3-OE algorithm, presented as a corollary of Theorems 3.1 and 3.2.

Corollary 3.3.

In SOPPP, let be an upper bound of . With appropriate choices of the parameters and , the expected regret of the Exp3-OE algorithm is:

• in the general cases.

• if assumption  is satisfied by the observation graphs .

A proof of Corollary 3.3 and the choices of the parameters  and (these choices are non-trivial) yielding these results will be given in Appendix F. We can extract from this proof several more explicit results as follows: in the general case, when the observations graphs are non-symmetric and if they are all symmetric; on the other hand, in cases that all the observation graphs satisfy , if the observations graphs are non-symmetric and if they are all symmetric.

We note that a trivial upper-bound of is the number of vertices of the graph which is (the number of edges in ). In general, the more connected is, the smaller may be chosen; and thus the better upper-bound of the expected regret. In the (classical) semi-bandit setting, and in the full-information setting, , . Finally, we also note that, if (this is typical in practice, including the CB and HS games), the bound in Corollary 3.3- matches in order with the bounds (ignoring the logarithmic factors) given by the FPL-IX algorithm (see [19]). On the other hand, the form of the regret bound provided by the Exp3-IX algorithm (see [19]) does not allow us to compare directly with the bound of Exp3-OE in the general SOPPP. Exp3-IX is only analyzed by [19] when , i.e., ; in this case, we observe that the bound given by our Exp3-OE algorithm is better than that of Exp3-IX (by some multiplicative constants).

4 Colonel Blotto Games and Hide-and-Seek Games as SOPPP

Given the regret analysis of Exp3-OE in SOPPP, we now return to our main motivation, the Colonel Blotto and the Hide-and-Seek games, and discuss how to apply our findings to these games. To address this, we define formally the online version of the games and show how these problems can be formulated as SOPPP in Sections 4.1 and 4.2, then we demonstrate the benefit of using the Exp3-OE algorithm for learning in these games (Section 4.3).

4.1 Colonel Blotto Games as an SOPPP

The online Colonel Blotto game (the CB game). This is a game between a learner and an adversary over  battlefields within a time horizon . Each battlefield has a value (unknown to the learner)666Knowledge on the battlefields’ values are not assumed lest it limits the scope of application of our model (see e.g., the radio resource allocation problem discussed in Section 1). at stage such that . At stage , the learner needs to distribute troops ( is fixed) towards the battlefields while the adversary simultaneously allocate hers; that is, the learner chooses a vector in the strategy set . At stage and battlefield , if the adversary’s allocation is strictly larger than the learner’s allocation , the learner loses this battlefield and she suffers the loss ; if they have tie allocations, she suffers the loss ; otherwise, she wins and suffers no loss. At the end of stage , the learner observes the loss from each battlefield (and which battlefield she wins, ties, or loses) but not the adversary’s allocations. The learner’s loss at each time is the sum of the losses from all the battlefields. The objective of the learner is to minimize her expected regret. Note that similar to SOPPP, we also consider the non-oblivious adversaries in the CB game.

While this problem can be formulated as a standard OComb, it is difficult to derive an efficient learning algorithm under that formulation, due to the learner’s exponentially large set of strategies that she can choose from per stage. Instead, we show that by reformulating the problem as an SOPPP, we will be able to exploit the advantages of the Exp3-OE algorithm to solve it. To do so, first note that the learner can deduce several side-observations as follows: if she allocates troops to battlefield and wins, she knows that if she had allocated more than troops to , she would also have won; if she knows the allocations are tie at battlefield , she knows exactly the adversary’s allocation to this battlefield and deduce all the losses she might have suffered if she had allocated differently to battlefield ; if she allocates troops to battlefield and loses, she knows that if she had allocated less than to battlefield , she would also have lost.

Now, to cast the CB game as SOPPP, for each instance of the parameters and , we create a DAG such that the strategy set has a one-to-one correspondence to the paths set of . Due to the lack of space, we only present here an example illustrating the graph of an instance of the CB game in Figure 1-(a) and we give the formal definition of  in Appendix G. The graph has edges and paths while the length of every path is . Each edge in corresponds to allocating a certain amount of troops to a battlefield. Therefore, the CB game model is equivalent to a PPP where at each stage the learner chooses a path in and the loss on each edge is generated from the allocations of the adversary and the learner (corresponding to that edge) according to the rules of the game. At stage , the (semi-bandit) feedback and the side-observations777E.g., in Figure 1-(a), if the learner chooses a path going through edge (corresponding to allocating troop to battlefield ) and wins (thus, the loss at edge is ), then she deduces that the losses on the edges , and (corresponding to allocating at least troop to battlefield ) are all . deduced by the learner as described above infers an observation graph . This formulation transforms any CB game into an SOPPP.

Note that since there are edges in that refer to the same allocation (e.g., the edges , and in all refer to allocating  troops to battlefield ), in the observation graphs, the vertices corresponding to these edges are always connected. Therefore, an upper bound of the independence number of in the CB game is . Moreover, we can verify that the observation graph of the CB game satisfies assumption  for any and it is non-symmetric.

4.2 Hide-and-Seek Games as an SOPPP

The online Hide-and-Seek game (the HS game). This is a repeated game (within the time horizon ) between a hider and a seeker. In this work, we consider that the learner plays the role of the seeker and the hider is the adversary. There are locations, indexed from to . At stage , the learner sequentially chooses locations (), called an -search, to seek for the hider, that is, she chooses (if , we say that location is her -th move). The hider maliciously assigns losses on all locations (intuitively, these losses can be the wasted time supervising a mismatch location or the probability that the hider does not hide there, etc.). In the HS game, the adversary is non-oblivious; moreover, in this work, we consider the following condition on how the hider/adversary assigns the losses on the locations:

• At stage , the adversary secretly assigns a loss to each location (unknown to the learner). These losses are fixed throughout the -search of the learner.

The learner’s loss at stage is the sum of the losses from her chosen locations in the -search at stage , that is . Moreover, often in practice the -search of the learner needs to satisfy some constraints. In this work, as an example, we use the following constraint: for a fixed (called the coherence constraint), i.e., the seeker cannot search too far away from her previously chosen location.888Our results can be applied to HS games with other constraints, such as , i.e., she can only search forward; or, , i.e., she cannot search a location more than times, etc. At the end of stage , the learner only observes the losses from the locations she chose in her -search, and her objective is to minimize her total loss over .

Similar to the case of the CB game, tackling the HS game as a standard OComb is computationally involved. As such, we follow the SOPPP formulation instead. To do this, we create a DAG whose paths set has a one-to-one correspondence to the set containing all feasible -search of the learner in the HS game with locations under -coherent constraint. Figure 1-(b) illustrates the corresponding graph of an instance of the HS game and we give a formal definition of in Appendix G. The HS game is equivalent to the PPP where the learner chooses a path in and edges’ losses are generated by the adversary at each stage (note that to ensure all paths end at , there are auxiliary edges in that are always embedded with losses). Note that there are edges and paths in . Moreover, knowing that the adversary follows condition , the learner can deduce the following side-observations: within a stage, the loss at each location remains the same no matter when it is chosen among the -search, i.e., knowing the loss of choosing location as her -th move, the learner knows all the loss if she chooses location as her -th move for any . The semi-bandit feedback and side-observations as described above generate the observation graphs (e.g., in Figure 1-(b), the edges , and represent that location is chosen; thus, they mutually reveal each other). The independence number of is for any . The observation graphs of the HS game are symmetric and do not satisfy . Finally, we consider a relaxation of condition :

• At stage , the adversary assigns a loss on each location . For , after the learner chooses, say location , as her -th move, the adversary can observe that and change the losses for any location that has not been searched before by the learner,999An interpretation is that by searching a location, the learner/seeker “discovers and secures” that location; therefore, the adversary/hider cannot change her assigned loss at that place. i.e., she can change the losses .

By replacing condition with condition , we can limit the side-observations of the learner: she can only deduce that if , the edges in representing choosing a location as the move reveals the edges representing choosing that same location as the -th move; but not vice versa. In this case, the observation graph is non-symmetric; however, its independence number is still as in the HS games with condition .

4.3 Performance of Exp3-OE in the Colonel Blotto and Hide-and-Seek Games

Having formulated the CB game and the HS game as SOPPPs, we can use the Exp3-OE algorithm in these games. From Section 3.1 and the specific graphs of the CB and HS game, we can deduce that Exp3-OE runs in at most time. We remark again that Exp3-OE’s running time is linear in and efficient in all cases unlike when we run FPL-IX in the CB and HS games. Moreover, we can deduce the following result directly from Corollary 3.3:

Corollary 4.1.

The expected regret of the Exp3-OE algorithm satisfies:

• in the CB games with troops and battlefields.

• in the HS games with locations and -search.

At a high-level, given the same scale on their inputs, the independence numbers of the observation graphs in HS games are smaller than in CB games (by a multiplicative factor of ). However, since assumption  is satisfied by the observation graphs of the CB games and not by the HS games, the expected regret bounds of the Exp3-OE algorithm in these games have the same order of magnitude. From Corollary 4.1, we note that in the CB games, the order of the regret bounds given by Exp3-OE is better than that of the FPL-IX algorithm (thanks to the fact that is satisfied).101010More explicitly, in the CB game, FPL-IX has a regret at most (C is a constant indicated by [19]) and Exp3-OE’s regret bound is (if , we can rewritten this bound as ). On the other hand, in the HS games with , the regret bounds of the Exp3-OE algorithm improves the bound of FPL-IX but they are still in the same order of the games’ parameters (ignoring the logarithmic factors).111111More explicitly, in HS games with , FPL-IX’s regret is and Exp3-OE’s regret is (similar results can be obtained for the HS games with ). Note that the the regret bound of Exp3-OE in the HS game with Condition  (involving symmetric observation graphs) is slightly better than that in the HS game with Condition .

We also conducted several numerical experiments that compares the running time and the actual expected regret of Exp3-OE and FPL-IX in CB and HS games. The numerical results are in consistent with theoretical results in this work. Our code for these experiments can be found at https://github.com/dongquan11/CB-HS.SOPPP.

Finally, we compare the regret guarantees given by our Exp3-OE algorithm and by the OSMD algorithm (see [3])—the benchmark algorithm for OComb with semi-bandit feedback (although OSMD does not run efficiently in general): Exp3-OE is better than OSMD in CB games if ; in HS games if and in the HS games with condition if . We give a proof of this statement in Appendix H. Intuitively, the regret guarantees of Exp3-OE is better than that of OSMD in the CB games where the learner’s budget is sufficiently larger than the number of battlefields and in the HS games where the total number of locations is sufficiently larger than the number of moves that the learner can make in each stage.

5 Conclusion

In this work, we introduce the Exp3-OE algorithm for the path planning problem with semi-bandit feedback and side-observations. Exp3-OE is always efficiently implementable. Moreover, it matches the regret guarantees compared to that of the FPL-IX algorithm (Exp3-OE is better in some cases). We apply our findings to derive the first solutions to the online version of the Colonel Blotto and Hide-and-Seek games. This work also extends the scope of application of the PPP model in practice, even for large instances.

Acknowledgment:

This work was supported by ANR through the “Investissements d’avenir” program (ANR-15-IDEX-02) and grant ANR-16-TERC0012; and by the Alexander von Humboldt Foundation.

References

• [1] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren (2015) Online learning with feedback graphs: beyond bandits. In JMLR Workshop and Conference Proceedings, Vol. 40. Cited by: footnote 1.
• [2] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour (2013) From bandits to experts: a tale of domination and independence. In Advances in Neural Information Processing Systems, pp. 1610–1618. Cited by: Appendix D, Appendix E, §3.2, §3.2, footnote 1, footnote 2.
• [3] J. Audibert, S. Bubeck, and G. Lugosi (2014) Regret in online combinatorial optimization. Mathematics of Operations Research 39 (1), pp. 31–45. Cited by: §4.3.
• [4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.
• [5] S. Behnezhad, S. Dehghani, M. Derakhshan, M. T. H. Aghayi, and S. Seddighin (2017) Faster and simpler algorithm for optimal strategies of blotto game. In

Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI-17)

,
pp. 369–375. Cited by: §G.1, §1.
• [6] S. Bhattacharya, T. Başar, and M. Falcone (2014) Surveillance for security as a pursuit-evasion game. In

International Conference on Decision and Game Theory for Security

,
pp. 370–379. Cited by: §1.
• [7] J. Blocki, N. Christin, A. Datta, A. D. Procaccia, and A. Sinha (2013) Audit games. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI-13), pp. 41–47. Cited by: §1.
• [8] E. Borel (1921) La théorie du jeu et les équations intégrales à noyau symétrique. Comptes rendus de l’Académie des Sciences 173 (1304-1308), pp. 58. Cited by: §1.
• [9] J. L. Bower and C. G. Gilbert (2005) From resource allocation to strategy. Oxford University Press. Cited by: §1.
• [10] N. Cesa-Bianchi and G. Lugosi (2012) Combinatorial bandits. Journal of Computer and System Sciences 78 (5), pp. 1404–1422. Cited by: §1, §3.2, footnote 2.
• [11] W. Chen, Y. Wang, and Y. Yuan (2013) Combinatorial multi-armed bandit: general framework and applications. In

Proceedings of the 30th International Conference on Machine Learning (ICML-1

,
pp. 151–159. Cited by: §1.
• [12] P. H. Chia (2012) Colonel Blotto in web security. In The 11th Workshop on Economics and Information Security, WEIS Rump Session, pp. 141–150. Cited by: §1.
• [13] S. F. Chien, C. C. Zarakovitis, Q. Ni, and P. Xiao (2019) Stochastic asymmetric blotto game approach for wireless resource allocation strategies. IEEE Transactions on Wireless Communications. Cited by: §1.
• [14] Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §1.
• [15] O. Gross and R. Wagner (1950) A continuous colonel blotto game. Technical report RAND project air force Santa Monica CA. Cited by: §1.
• [16] A. György, T. Linder, G. Lugosi, and G. Ottucsák (2007) The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research 8 (Oct), pp. 2369–2403. Cited by: Appendix A, §3.
• [17] J. P. Hespanha, M. Prandini, and S. Sastry (2000) Probabilistic pursuit-evasion games: a one-step nash approach. In Proceedings of the 39th IEEE Conference on Decision and Control, pp. 2272–2277. Cited by: §1.
• [18] A. Kalai and S. Vempala (2005) Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71 (3), pp. 291–307. Cited by: §1.
• [19] T. Kocák, G. Neu, M. Valko, and R. Munos (2014) Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pp. 613–621. Cited by: Appendix D, §1, §2, §3.2, §3.2, footnote 1, footnote 10, footnote 3.
• [20] D. Korzhyk, V. Conitzer, and R. Parr (2010) Complexity of computing optimal stackelberg strategies in security resource allocation games. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI-10), pp. 805–810. Cited by: §1.
• [21] D. Kovenock and B. Roberson (2012) Coalitional Colonel Blotto games with application to the economics of alliances. Journal of Public Economic Theory 14 (4), pp. 653–676. External Links: Document Cited by: §1.
• [22] S. Mannor and O. Shamir (2011) From bandits to experts: on the value of side-observations. In Advances in Neural Information Processing Systems, pp. 684–692. Cited by: Appendix D, §3.2, footnote 1.
• [23] A. M. Masucci and A. Silva (2014-Sep.) Strategic resource allocation for competitive influence in social networks. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 951–958. Cited by: §1.
• [24] A. M. Masucci and A. Silva (2015-12) Defensive resource allocation in social networks. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), Vol. , pp. 2927–2932. Cited by: §1.
• [25] B. Roberson (2006) The Colonel Blotto game. Economic Theory 29 (1), pp. 2–24. External Links: Document Cited by: §1, §1.
• [26] S. Sakaue, M. Ishihata, and S. Minato (2018) Efficient bandit combinatorial optimization algorithm with zero-suppressed binary decision diagrams. In International Conference on Artificial Intelligence and Statistics, pp. 585–594. Cited by: Appendix A.
• [27] G. Schwartz, P. Loiseau, and S. S. Sastry (2014) The heterogeneous Colonel Blotto game. In Proceedings of the 7th International Conference on Network Games, Control and Optimization (NetGCoop), pp. 232–238. Cited by: §1, §1.
• [28] E. Takimoto and M. K. Warmuth (2003) Path kernels and multiplicative updates. Journal of Machine Learning Research 4 (Oct), pp. 773–818. Cited by: Appendix A, §3.
• [29] R. Vidal, O. Shakernia, H. J. Kim, D. H. Shim, and S. Sastry (2002) Probabilistic pursuit-evasion games: theory, implementation, and experimental evaluation. IEEE transactions on robotics and automation 18 (5), pp. 662–669. Cited by: §1.
• [30] J. Von Neumann (1953) A certain zero-sum two-person game equivalent to the optimal assignment problem. Contributions to the Theory of Games 2, pp. 5–12. Cited by: §1.
• [31] D. Q. Vu, P. Loiseau, and A. Silva (2018-07) Efficient computation of approximate equilibria in discrete colonel blotto games. In Proceedings of the 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence (IJCAI-ECAI), pp. 519–526. Cited by: §1.
• [32] Q. Wang and M. Liu (2016) Learning in hide-and-seek. IEEE/ACM Transactions on Networking 24 (2), pp. 1279–1292. Cited by: §1.
• [33] Y. Yavin (1987) Pursuit–evasion differential games with deception or interrupted observation. In Pursuit-Evasion Differential Games, pp. 191–203. Cited by: §1.
• [34] T. Yucek and H. Arslan (2009) A survey of spectrum sensing algorithms for cognitive radio applications. IEEE communications surveys & tutorials 11 (1), pp. 116–130. Cited by: §1.
• [35] C. Zhang, V. Lesser, and P. Shenoy (2009) A multi-agent learning approach to online distributed resource allocation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, Cited by: §1.

Appendix A Weight Pushing for Path Sampling

We re-visit some useful results in the literature. In this section, we consider a DAG with parameters as introduced in Section 2. For simplicity, we assume that each edge in belongs to at least one path in . Let us respectively denote by and the set of the direct successors and the set of the direct predecessors of any vertex . Moreover, let and respectively denote the edge and the set of all paths from vertex to vertex .

Let us consider a weight for each edge . It is needed in the Exp3-OE algorithm to sample a path with the probability:

 x(~p):=[∏e∈~pw(e)]/[∑p∈P∏e∈pw(e)]. (4)

A direct computation and sampling from takes time which is very inefficient. To efficiently sample the path, we first label the vertices set by such that if there exists an edge connecting to then . We then define the following terms for each vertex :

 H(s,u):=∑p∈Ps,u∏e∈pw(e) and H(u,d):=∑p∈Pu,d∏e∈pw(e).

Intuitively, is the aggregate weight of all paths from vertex to vertex  and is exactly the denominator in (4). These terms and can be recursively computed by the WP algorithm (i.e., Algorithm 1) that runs in time, through dynamic programming. This is called weight pushing and it is used by [16, 26, 28].

Based on the WP algorithm (i.e., Algorithm 1), we construct the WPS algorithm (i.e., Algorithm 2) that uses the weights as inputs and randomly outputs a path in . Intuitively, starting from the source vertex , Algorithm 2 sequentially samples vertices by vertices based on the terms computed by Algorithm 1. It is noteworthy that Algorithm 2 also runs in time and it is trivial to prove that the probability that a path is sampled from Algorithm 2 matches exactly .

Appendix B Proof of Algorithm 3.1’s Output

Proof.

Fixing an edge , we prove that when Algorithm 3.1 takes the edges weights as the input, it outputs exactly . We note that if , then .

We denote and label the edges in the set by . The for-loop in lines - of Algorithm 3.1 consecutively run with the edges in as follows:

• After the for-loop runs for , we have ; therefore, since computed from the original weights . Due to line  that sets , henceforth in Algorithm 3.1, the weight of any path that contains is set to .

• Let the for-loop run for , we have because any path has the weight . Therefore, .

• Similarly, after the for-loop runs for (where ), we have:

 qt(e)=i∑k=1⎛⎜ ⎜ ⎜⎝∑{p∋ek}∖⋃j
• Therefore, after the for-loop finishes running for every edge in ; we have where each term was only counted once even if contains more than one edge that reveals the edge .

Appendix C Proof of Theorem 3.1

See 3.1

Proof.

We first denote121212We recall that . . From line of Algorithm 3, we trivially have:

 wt+1(p) =wt(p)⋅exp(−η^Lt(p)),∀p∈P,∀t∈[T−1]. (5)

We recall that and the notation denoting the expectation w.r.t. to the randomness in choosing in Algorithm 3 (i.e., w.r.t. the information up to time ). From (2), we have:

 Et[^Lt(p)]≤Lt(p):=∑e∈pℓt(e),∀p∈P. (6)

Under the condition that , we obtain:

 Wt+1Wt =∑p∈Pwt+1(p)Wt =∑p∈Pwt(p)⋅exp(−η^Lt(p))Wt =∑p∈Pxt(p)⋅exp(−η^Lt(p))) ≤∑p∈P[xt(p)(1