# Colonel Blotto Games and Hide-and-Seek Games as Path Planning Problems with Side Observations

Resource allocation games such as the famous Colonel Blotto (CB) and Hide-and-Seek (HS) games are often used to model a large variety of practical problems, but only in their one-shot versions. Indeed, due to their extremely large strategy space, it remains an open question how one can efficiently learn in these games. In this work, we show that the online CB and HS games can be cast as path planning problems with side-observations (SOPPP): at each stage, a learner chooses a path on a directed acyclic graph and suffers the sum of losses that are adversarially assigned to the corresponding edges; and she then receives semi-bandit feedback with side-observations (i.e., she observes the losses on the chosen edges plus some others). Then, we propose a novel algorithm, EXP3-OE, the first-of-its-kind with guaranteed efficient running time for SOPPP without requiring any auxiliary oracle. We provide an expected-regret bound of EXP3-OE in SOPPP matching the order of the best benchmark in the literature. Moreover, we introduce additional assumptions on the observability model under which we can further improve the regret bounds of EXP3-OE. We illustrate the benefit of using EXP3-OE in SOPPP by applying it to the online CB and HS games.

There are no comments yet.

## Authors

• 7 publications
• 19 publications
• 9 publications
• 26 publications
05/27/2019

### Colonel Blotto and Hide-and-Seek Games as Path Planning Problems with Side Observations

Resource allocation games such as the famous Colonel Blotto (CB) and Hid...
11/19/2019

### Path Planning Problems with Side Observations-When Colonels Play Hide-and-Seek

Resource allocation games such as the famous Colonel Blotto (CB) and Hid...
09/11/2019

### Combinatorial Bandits for Sequential Learning in Colonel Blotto Games

The Colonel Blotto game is a renowned resource allocation problem with a...
03/23/2021

### Bandit Learning for Dynamic Colonel Blotto Game with a Budget Constraint

We consider a dynamic Colonel Blotto game (CBG) in which one of the play...
02/10/2011

### Toward a Classification of Finite Partial-Monitoring Games

Partial-monitoring games constitute a mathematical framework for sequent...
01/15/2020

### Offline Grid-Based Coverage path planning for guards in games

Algorithmic approaches to exhaustive coverage have application in video ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Resource allocation games have been studied profoundly in the literature and showed to be very useful to model many practical situations, including online decision problems, see e.g. blocki2013audit ; bower2005resource ; korzhyk2010complexity ; zhang2009multi . In particular, two of the most renowned are the Colonel Blotto game (henceforth, CB game) and the Hide-and-Seek game (henceforth, HS game). In the (one-shot) CB game, two players, each with a fixed amount of budget, simultaneously allocate their (indivisible) resources on battlefields, each player’s payoff is the aggregate of the values of battlefields where she has a higher allocation. The scope of applications of the CB games includes a variety of problems; for instance, in security (e.g., chia2012 ; schwartz2014 ) where resources correspond to security forces, in politics (e.g., kovenock2012 ; roberson2006 ) for allocating budget to attract voters, and in advertisement (e.g., masucci2014 ; masucci2015 ) for distributing the broadcasting time. On the other hand, in the (one-shot) HS game, a seeker chooses among locations () to search for a hider, who randomly chooses to hide in one of the

locations. The seeker’s payoff is the probability that she finds the hider and the hider’s payoff is the probability that she successfully escape the seeker’s pursuit. Several variants of the HS games are used to model surveillance situations

bhattacharya2014surveillance ; bhattacharya2009existence , anti-jamming problems in telecommunications navda2007using ; wang16 ; xu2005feasibility , vehicles control chung2011search ; vidal2002probabilistic , etc.

Both the CB games and the HS games have a long-standing history (originated in 1921 borel1921 and 1953 vonneumann53 , respectively); however, the results achieved so-far in these games are mostly limited to their one-shot and full-information version (see e.g., Behnezhad17a ; grosswagner ; roberson2006 ; schwartz2014 ; Vu18a for CB games and grote1975theory ; hespanha2000probabilistic ; yavin1987pursuit for HS games). On the contrary, in most of the applications (e.g., web security, advertising, telecommunications), a more natural setting is to consider the case where the game is played repeatedly and players have access only to incomplete information at each stage. In this setting, players are often required to sequentially learn the game on-the-fly and adjust the trade-off between exploiting known information and exploring to gain new information. Thus, this work focuses on the following sequential learning problem: at each stage, a learner plays a CB game (resp. HS game); at the end of the stage, she receives limited feedback that is the gain she obtains from each battlefield (resp. the hider’s escape probability corresponding to the chosen locations); and her objective is to maximize her cumulative payoffs. A formal definition of these problems is given in Section 4; hereinafter, we reuse the term CB game and HS game to refer to this sequential learning version of the games. The main challenge in those games is that their strategy space is exponential in the natural parameters (e.g., number of troops and battlefields in the CB game, number of locations in the HS game); hence how to efficiently learn in these games is an open question.

Our first contribution towards solving this open question is to show that the CB and HS games can be cast as Path Planning Problems (henceforth, PPP), one of the most well-studied instances of the

Online Combinatorial Optimization

framework (henceforth, OComb; see chen2013combinatorial for a survey). In PPPs, given a graph with edges, at each stage, a learner chooses a path; then a loss in is adversarially chosen for each edge and the learner suffers the aggregate of edges’ losses belonging to her chosen path. The learner’s goal is to minimize regret.111The regret is the difference between the learner’s cumulative loss and that of the best action in hindsight. The information that the learner receives in the CB and HS games as described above straightforwardly corresponds to the so-called semi-bandit feedback setting of PPPs, i.e., at the end of each stage, the learner observes the edges’ losses belonging to her chosen path. However, the specific structure of the considered games also allows the learner to deduce (without any extra cost) from the semi-bandit feedback the losses of some of the other edges that may not belong to the chosen path; these are called side-observations. Henceforth, we will use the term SOPPP to refer to this PPP under semi-bandit feedback with side-observations.

SOPPP is a special case of OComb with side-observations (henceforth, SOComb) studied by kocak14 and, following their approach, we will use observation graphs222The observation graphs, proposed in kocak14 and used here for SOPPP, extend the side-observations model for multi-armed bandits problems studied by alon15 ; alon13 ; mannor11 . Indeed, they capture side-observations between edges whereas the side-observations model considered in alon15 ; alon13 ; mannor11 is between actions (i.e., paths in PPPs). (defined in Section 2) to capture the learner’s observability. In kocak14 , the authors focus on the class of Follow-the-Perturbed-Leader (FPL) algorithms (originated from kalai2005efficient ) and propose an algorithm named FPL-IX for SOComb, which could be applied directly to SOPPP. However, this faces two main problems: () the efficiency of FPL-IX is only guaranteed with high-probability (as it depends on the geometric sampling technique) and () it requires that there exists an efficient oracle that solves an optimization problem at each stage—both of which are incompatible with our goal of learning in the CB and HS games.

In this paper, we focus instead on another prominent class of OComb algorithms, called Exp3 auer02b ; Freund1997 . Then, our second contribution is to propose an algorithm for SOPPP that solves both of the aforementioned issues and provides good regret guarantees. In more details, this contribution is three-fold: We propose a novel algorithm, Exp3-OE, that is applicable to any instance of SOPPP. Importantly, Exp3-OE is always guaranteed to run efficiently (i.e., in polynomial time in terms of the number of edges of the graph in SOPPP) without the need of any auxiliary oracle; We prove that Exp3-OE guarantees an upper-bound on the expected regret matching in order with the best benchmark in the literature (the FPL-IX algorithm). We also prove further improvements under additional assumptions on the observation graphs that have been so-far ignored in the literature; We demonstrate the benefit of using the Exp3-OE algorithm in the CB and HS games.

Our Exp3-OE algorithm is based on the Exp3-IX algorithm kocak14 . However, Exp3-IX has a very inefficient running time in SOComb (and particularly in SOPPP) and thus, it is only analyzed by kocak14 in the trivial cases of SOComb involving only actions with L1-norm that equals to 1 (corresponding to SOPPP with graphs where all paths have length )—the existence of an efficient implementation of Exp3-type algorithms in SOComb is left as an open question in kocak14 . We address this question in the particular case of SOPPP as follows. We introduce two main major updates in Exp3-OE. First, unlike Exp3-IX that uses adaptive implicit exploration scheme, we assume that the time horizon is known333If is unknown, we can use the doubling trick (see auer1995gambling ; besson2018doubling ) to get similar results.

in advance and fix an implicit exploration parameter in the loss estimator of

Exp3-OE. This change reduces the computations and leads to a different parameters tuning scheme with improved regret bounds compared to Exp3-IX. Second (and the main reason that makes Exp3-OE significantly more efficient than Exp3-IX), we use a novel loss estimator, which enables us to efficiently compute it based on a dynamic-programming technique, called weight pushing. Note that while weight pushing has been used for efficiently sampling paths from exponentially-updated weights in several variants of Exp3 (e.g., gyorgy2007 ; sakaue2018 ; takimoto2003 ), the way we apply it to compute the loss estimator is novel and non-trivial. Finally, note that the SOPPP model (and thus, our proposed Exp3-OE algorithm) can be applied into many problems beyond the considered games, e.g., auctions, recommendation systems.

Throughout the paper, we use bold symbols to denote vectors, e.g.,

, and to denote the -th element. For any , the set is denoted by and the indicator function of a set is denoted by . For graphs, we write either or to refer that an edge belongs to a path . For the sake of conciseness, we present first our second contribution on the SOPPP in general and we then return in Section 4 to our first contribution relating to the CB and HS games.

## 2 Path Planning Problems with Side-Observations (SOPPP) Formulation

As discussed in Section 1, motivated by the CB and HS games, we focus on the path planning problem with semi-bandit and side-observations feedback (SOPPP) and design an Exp3-type algorithm that always runs efficiently in SOPPP. To do this, we first formally define the SOPPP model as follows.

SOPPP model. Consider a directed acyclic graph (henceforth, DAG), denoted by , whose set of vertices and set of edges are respectively denoted by and . Let and ; there are two special vertices, a source and a destination, that are respectively called and . We denote by the set of all paths starting from and ending at . Each path corresponds to a vector in (thus, ) where if and only if edge belongs to . Let be the length of the longest path in , that is . Given a time horizon , at each (discrete) stage , a learner chooses a path . Then, a loss vector is secretly and adversarially chosen (oblivious from the learner’s decisions). Each element corresponds to the scalar loss embedded on the edge . The learner’s incurred loss is , i.e., the sum of the losses from all the edges belonging to . The learner’s feedback at stage after choosing is presented as follows. First, she receives a semi-bandit feedback, that is, she observes all the edges’ losses , for any belonging to the chosen path . Additionally, each edge may reveal the losses on several other edges. To represent these side-observations at time , we consider a graph, denoted , containing vertices. Each vertex of corresponds to an edge of the graph . There exists a directed edge from a vertex to a vertex in if, by observing the edge loss , the learner can also deduce the edge loss  (we also denote this by  and say that the edge reveals the edge ). The objective of the learner is to minimize the cumulative expected regret, defined as .

Hereinafter, in places where there is no ambiguity, we use the term path to refer to a path in and the term observation graphs to refer to . In general, these observation graphs can depend on the decisions of both the learner and the adversary. On the other hand, all vertices in always have self-loops. In the case where none among contains any other edge than these self-loops, no side-observation is allowed and the problem is reduced to the classical semi-bandit setting. If all are complete graphs, SOPPP corresponds to the full-information PPPs. In this work, we focus on considering the uninformed setting, i.e., the learner observes only after making a decision at time . On the other hand, let us introduce two new notations:

 Ot(e):={p∈P:∃e′∈p,e′→e},∀e∈E~{}and~{}Ot(p):={e∈E:∃e′∈p,e′→e},∀p∈P.

Intuitively, is the set of all paths that, if chosen, reveal the loss on the edge and is the set of all edges whose losses are revealed if the path is chosen. Trivially, . Moreover, due to the semi-bandit feedback, if , then and . Apart from the results for general observation graphs, in this work, we additionally present several results under two particular assumptions, satisfied by some instances in practice (e.g., the CB and HS games), that provide more refined regret bounds compared to cases that were considered in kocak14 :  symmetric observation graphs where for each edge from to , there also exists an edge from to (i.e., if then ); i.e., is an undirected graph;  observation graphs under the following assumption  that requires that if two edges belong to a path in , then they cannot simultaneously reveal the loss of another edge.

• For any , if and , then .

## 3 Exp3-OE - An Efficient Algorithm for the SOPPP

In this section, we present a new algorithm for SOPPP, called Exp3-OE (OE stands for Observable Edges), whose pseudo-code is given by Algorithm 1. The guarantees on the expected regret of Exp3-OE in SOPPP is analyzed in Section 3.2. More importantly, Exp3-OE always runs efficiently in polynomial time in terms of the number of edges of ; this is discussed in Section 3.1.

As an Exp3-type algorithm, Exp3-OE relies on the average weights sampling where at stage we update the weight on each edge by the exponential rule (line ). For each path , we denote the path weight and define the following normalized terms, according to which a path is sampled at each stage (see line ) of the Exp3-OE algorithm:

 (1)

Compared to other instances of the Exp3-type algorithms, Exp3-OE has two major differences. First, at each stage , the loss of each edge is estimated by (line ) based on the term and a parameter . Intuitively, is the probability that the loss on the edge is revealed from playing the chosen path at . On the other hand, the implicit exploration parameter added to the denominator allows us to “pretend to explore" in Exp3-OE without knowing the observation graph before making the decision at stage (the uninformed setting). Unlike the standard Exp3 algorithm, the loss estimator used in Exp3-OE is biased, that is

 Et[^ℓt(e)]=∑~p∈Pdt(~p)ℓt(e)qt(e)+βI{e∈Ot(~p)}=∑~p∈Ot(e)dt(~p)ℓt(e)∑p∈Ot(e)dt(p)+β≤ℓt(e),∀e∈E. (2)

Here, denotes the expectation w.r.t. the randomness of choosing a path at stage . Second, unlike standard Exp3 algorithms that keep track and update on the weight of each path, the weight pushing technique is applied at line (via Algorithm 4 in Appendix A) and line (via Algorithm 2 in Section 3.1) where we work with edges weights instead of paths weights (recall that ).

### 3.1 Running Time Efficiency of the Exp3-OE Algorithm

We recall that in order to efficiently sample a path according to , following the literature, it is useful to compute the terms and for any vertex in . Intuitively, is the aggregate weight of all paths from vertex to vertex  at stage . Then, a path in is sampled sequentially edge-by-edge based on these terms . The collection of the computations described above is often referred to as weight pushing, that can be done in by exploiting the structure of the graph. We rewrite this step formally in Appendix A.

The final non-trivial step to efficiently implement Exp3-OE is to compute , the probability that an edge is revealed at stage , needed in line . We note that is the sum of terms; therefore, a direct computation is inefficient while a naive application of the weight pushing technique can easily lead to errors. To compute , we propose Algorithm 2, a non-straightforward application of weight pushing, in which we consecutively consider all the edges . Then, we take the sum of the terms of the paths going through by the weight pushing technique while making sure that each of these terms is only included one time, even if has more than one edge revealing  (this is a non-trivial step). In Algorithm 2, we denote by the set of the direct successors of any vertex . A proof that Algorithm 2 outputs exactly as defined in line of Algorithm 1 can be found in Appendix B. Algorithm 2 runs in time; therefore, line of Algorithm 1 can be done in at most time. In conclusion, the Exp3-OE algorithm runs in at most time, this guarantee works even for the worst-case scenario. For comparison, the running time of FPL-IX proposed by kocak14 is in expectation if we choose Dijkstra’s algorithm to be the optimization oracle at each stage. On the other hand, with the chosen parameters in kocak14 , we can deduce that FPL-IX achieves the running time in444The notation is a version of the big-O asymptotic notation that ignores the logarithmic terms. with a probability at least for an arbitrary . That is, FPL-IX is not guaranteed to have efficient running time in all cases.

### 3.2 Performance of the Exp3-OE Algorithm

In this section, we present an upper-bound of the expected regret achieved by the Exp3-OE algorithm in the SOPPP. For the sake of brevity, with defined in (1), for any and , we denote:

 rt(e):=∑p∋edt(p) and Qt:=∑e∈Ert(e)/(qt(e)+β). (3)

Intuitively, is the probability that the chosen path at stage contains an edge and is the summation over all the edges of the ratio of this quantity and the probability that the loss of an edge is revealed (plus ). We can bound the expected regret with this key term .

###### Theorem 3.1.

The expected regret of the Exp3-OE algorithm in the SOPPP satisfies:

 RT≤ln(|P|)/η+[β+(n⋅η)/2]⋅∑t∈[T]Qt. (4)

The proof of Theorem 3.1 is given in Appendix C and has an approach similar to alon13 ; cesa2012 with several necessary adjustments to handle the new biased loss estimator in Exp3-OE. To see the relationship between the structure of the side-observations of the learner and the bound of the expected regret, we look for the upper-bounds of in terms of the observation graphs’ parameters. Let be the independence number555The independence number of a directed graph is computed while ignoring the direction of the edges. of , we have the following statement.

###### Theorem 3.2.

Let us denote , and , Upper-bounds of in different cases of  are given in the following table:

A proof of this theorem is given in Appendix E. The main idea of this proof is based on several graph theoretical lemmas that are extracted from alon13 ; kocak14 ; mannor11 . These lemmas establish the relationship between the independence number of a graph and the ratios of the weights on the graph’s vertices that have similar forms to the key-term . The case where observation graphs are non-symmetric and do not satisfy assumption  is the most general setting. Moreover, as showed in Theorem 3.2, the bounds of are improved if the observation graphs satisfy either the symmetry condition or assumption . Intuitively, given the same independence numbers, a symmetric observation graph gives the learner more information than a non-symmetric one; thus, it may yield a better bound on and the expected regret. On the other hand, assumption  is a technical assumption that allows the use of different techniques in the proofs to obtain better bounds. These cases have not been analyzed in the literature while they are satisfied by several practical situations, including the CB and HS games (see Section 4).

Finally, we give results on the order of the upper-bounds of the expected regret, obtained by the Exp3-OE algorithm, presented as a corollary of Theorems 3.1 and 3.2.

###### Corollary 3.3.

In SOPPP, let be an upper bound of . With appropriate choices of the parameters and , the expected regret of the Exp3-OE algorithm is:

• in the general cases.

• if assumption  is satisfied by the observation graphs .

The choices of the parameters  and (which are non-trivial in the cases where the observation graphs are non-symmetric) that yield these results will be given in Appendix F. We also note that a trivial upper-bound of is the number of vertices of the graph which is (the number of edges in ). In general, the more connected is, the smaller may be chosen; and thus the better upper-bound of the expected regret. In the (classical) semi-bandit setting, and in the full-information setting, . Finally, we also note that, if (this is typical in practice, including the CB and HS games), the bound in Corollary 3.3- matches in order with the bounds (ignoring the logarithmic factors) given by the FPL-IX algorithm (see kocak14 ). On the other hand, the form of the regret bound provided by the Exp3-IX algorithm (see kocak14 ) does not allow us to compare directly with the bound of Exp3-OE in the general SOPPP. In kocak14 , Exp3-IX is only analyzed when , i.e., ; in this case, we observe that the bound given by our Exp3-OE algorithm is better than that of Exp3-IX (by some multiplicative constants).

## 4 Colonel Blotto Games and Hide-and-Seek Games as SOPPP

Given the regret analysis of Exp3-OE in SOPPP, we now return to to our main motivation, the Colonel Blotto and the Hide-and-Seek games, and discuss how to apply our findings to these games. To address this, we define formally the online version of the games and show how these problems can be formulated as SOPPP in Sections 4.1 and 4.2, then we demonstrate the benefit of using the Exp3-OE algorithm for learning in these games (Section 4.3).

### 4.1 Colonel Blotto Games as an SOPPP

The online Colonel Blotto game. This is a game between a learner and an adversary over  battlefields within a time horizon . Each battlefield has a value (unknown to the learner) at stage such that . At stage , the learner needs to distribute troops ( is fixed) towards the battlefields while the adversary simultaneously allocate hers. The learner’s strategy set is . At stage and battlefield , if the adversary’s allocation is strictly larger than the learner’s allocation, the learner loses this battlefield and she suffers the loss ; if they have tie allocations, she suffers the loss ; otherwise, she wins and suffers no loss. At the end of stage , the learner observes the loss from each battlefield (and which battlefield she wins, ties, or loses) but not the adversary’s allocations. The learner’s loss at each time is the sum of the losses from all the battlefields. The objective of the learner is then to minimize her loss over a finite period of time.

While this problem can be formulated as a standard OComb, it is difficult to derive an efficient learning algorithm under that formulation, due to the learner’s exponentially large set of strategies that she can choose from per stage. Instead, we show that by reformulating the problem as an SOPPP, we will be able to exploit the advantages of the Exp3-OE algorithm to solve it. To do so, first note that the learner can deduce several side-observations as follows: if she allocates troops to battlefield and wins, she knows that if she had allocated more than troops to , she would also have won; if she knows the allocations are tie at battlefield , she knows exactly the adversary’s allocation to this battlefield and deduce all the losses she might have suffered if she had allocated differently to battlefield ; if she allocates troops to battlefield and loses, she knows that if she had allocated less than to battlefield , she would also have lost.

Now, to cast the CB game as SOPPP, for each instance of the parameters and , we create a DAG such that the strategy set has a one-to-one correspondence to the paths set of . The formal definition of will be given in Appendix G; due to the lack of space, we only present here an example illustrating the graph of an instance of the CB game in Figure 3-(a). The graph has edges and paths while the length of every path is . Each edge in corresponds to allocating a certain amount of troops to a battlefield. Therefore, the CB game model is equivalent to a PPP where at each stage the learner chooses a path in and the loss on each edge is generated from the allocations of the adversary and the learner (corresponding to that edge) according to the rules of the game. At stage , the (semi-bandit) feedback and the side-observations666E.g., in Figure 3-(a), if the learner chooses a path going through edge (corresponding to allocating troop to battlefield ) and wins (thus, the loss at edge is ), then she deduces that the losses on the edges , and (corresponding to allocating at least troop to battlefield ) are all . deduced by the learner as described above infers an observation graph . This formulation indeed transforms any CB game into an SOPPP.

Note that since there are edges in that refer to the same allocation (e.g., the edges , and in all refer to allocating  troops to battlefield ), in the observation graphs, the vertices corresponding to these edges are always connected. Therefore, an upper bound of the independence number of in the CB game is . Moreover, we can verify that the observation graph of the CB game satisfies assumption  for any and it is non-symmetric.

### 4.2 Hide-and-Seek Games as an SOPPP

The online Hide-and-Seek game. This is a repeated game (within the time horizon ) between a hider and a seeker. In this work, we consider that the learner plays the role of the seeker and the hider is the adversary. There are locations, indexed from to . At each stage , the learner sequentially chooses locations, called an -search, to seek for the hider, that is, she chooses an (if , we say that location is her -th move). The hider maliciously assigns losses on all locations (intuitively, these losses can be the wasted time supervising a mismatch location or the probability that the hider does not hide there, etc.). In this work, we consider the following condition on how the hider/adversary assigns the losses on the locations.

• At stage , the adversary secretly assigns a loss to each location (unknown to the learner). These losses are fixed throughout the -search of the learner.

The learner’s loss at stage is the sum of the losses from her chosen locations in the -search at stage , that is . Moreover, often in practice the -search of the learner needs to satisfy some constraints. In this work, as an example, we use the following constraint: for a fixed (called the coherence constraint), i.e., the seeker cannot search too far away from her previously chosen location.777Our results can be applied to HS games with other constraints, such as , i.e., she can only search forward; or, , i.e., she cannot search a location more than times, etc. At the end of stage , the learner only observes the losses from the locations she chose among her -search, and her objective is to minimize her total loss over .

Similar to the case of the CB game, tackling the HS game as a standard OComb is computationally involved. As such, we follow the SOPPP formulation instead. In particular, knowing that the adversary follows condition , the learner can deduce the following side-observations: within a stage, the loss at each location remains the same no matter when it is chosen among the -search; that is, knowing the loss of choosing location as her -th move, the learner knows all the loss if she chooses location as her -th move for any . Given this, we create a DAG whose paths set has a one-to-one correspondence to the set containing all feasible -search of the learner in the HS game with locations under -coherent constraint. A formal definition of is given in Appendix G. The HS game is equivalent to the PPP where the learner chooses a path in and edges’ losses are generated by the adversary at each stage (note that to ensure all paths end at , there are auxiliary edges in that are always embedded with losses). Figure 3-(b) illustrates the corresponding graph of an instance of the HS game. We note that there are edges and paths in .

The semi-bandit feedback and side-observations as described above generate an observation graph at time  (e.g., in Figure 3-(b), the edges , and represent that location is chosen; thus, they mutually reveal each other). The independence number of is for any . We note that the observation graphs of the HS game are symmetric and do not satisfy assumption . Finally, we consider a relaxation of condition :

• At stage , the adversary assigns a loss on each location . For , after the learner chooses, say location , as her -th move, the adversary can observe that and change the losses for any location that has not been searched before by the learner,888An interpretation is that by searching a location, the learner/seeker “discovers and secures" that location; therefore, the adversary/hider cannot change her assigned loss at that place. i.e., she can change the losses .

By replacing condition with condition , we can limit the side-observations of the learner: she can only deduce that if , the edges in representing choosing a location as the move reveals the edges representing choosing that same location as the -th move; but not vice versa. In this case, the observation graph only contains directed edges; however, its independence number is still as in the HS games with condition .

### 4.3 Performance of Exp3-OE in the Colonel Blotto and Hide-and-Seek Games

Having formulated the CB game and the HS game as SOPPPs, we can use the Exp3-OE algorithm to achieve the following results (deduced directly from Corollary 3.3).

###### Corollary 4.1.

The expected regret of the Exp3-OE algorithm satisfies:

• in the CB games with troops and battlefields.

• in the HS games with locations and -search.

At a high-level, given the same scale on their inputs, the independence numbers of the observation graphs in HS games are smaller than in CB games (by a multiplicative factor of ). However, since assumption  is satisfied by the observation graphs of the CB games and not by the HS games, the expected regret bounds of the Exp3-OE algorithm in these games have the same order of magnitude. From Corollary 4.1, we note that in the CB games, the order of the regret bounds given by Exp3-OE is better than that of the FLP-IX algorithm (thanks to the fact that is satisfied). On the other hand, in the HS games with condition  involving symmetric observation graphs, the regret bounds of the Exp3-OE algorithm improves the bound of FPL-IX but they are still in the same order of the games’ parameters (ignoring the logarithmic factors). Finally, we compare the regret guarantees given by our Exp3-OE algorithm and by the Online Stochastic Mirror Descent algorithm (henceforth, OSMD; see AudibertBL2014 )—the benchmark algorithm for OComb with semi-bandit feedback (although OSMD does not run efficiently in general). Applying OSMD to the CB and HS games (as SOPPP), the side-observations are ignored and the expected regret bound guaranteed by OSMD is in . Using the parameters and chosen for Corollary 3.3 and 4.1 (see Appendix F) in the corresponding cases of the observation graphs, the Exp3-OE algorithm provides a better upper-bound of the expected regret than OSMD in the CB games if ; in the HS games with condition if ; and in the HS games with condition if . A proof of this statement is given in Appendix H.

## 5 Conclusion

In this work, we introduce the Exp3-OE algorithm for the path planning problem with semi-bandit feedback and side-observations. Exp3-OE is always efficiently implementable. Moreover, it matches the regret guarantees compared to that of the FPL-IX algorithm. We apply our findings to derive the first solutions to the online version of the Colonel Blotto and Hide-and-Seek games. This work also extends the scope of application of the PPP model in practice, even for large instances.

## References

• (1) Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In JMLR Workshop and Conference Proceedings, volume 40. Microtome Publishing, 2015.
• (2) Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618, 2013.
• (3) Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
• (4) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In focs, page 322. IEEE, 1995.
• (5) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
• (6) Soheil Behnezhad, Sina Dehghani, Mahsa Derakhshan, MohammadTaghi HajiAghayi, and Saeed Seddighin. Faster and simpler algorithm for optimal strategies of Blotto game. In AAAI, pages 369–375, 2017.
• (7) Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.
• (8) Sourabh Bhattacharya, Tamer Başar, and Maurizio Falcone. Surveillance for security as a pursuit-evasion game. In

International Conference on Decision and Game Theory for Security

, pages 370–379. Springer, 2014.
• (9) Sourabh Bhattacharya and Seth Hutchinson. On the existence of nash equilibrium for a two player pursuit-evasion game with visibility constraints. In Algorithmic Foundation of Robotics VIII, pages 251–265. Springer, 2009.
• (10) Jeremiah Blocki, Nicolas Christin, Anupam Datta, Ariel D Procaccia, and Arunesh Sinha. Audit games. In

Twenty-Third International Joint Conference on Artificial Intelligence

, 2013.
• (11) Emile Borel. La théorie du jeu et les équations intégrales à noyau symétrique. Comptes rendus de l’Académie des Sciences, 173(1304-1308):58, 1921.
• (12) Joseph L Bower and Clark G Gilbert. From resource allocation to strategy. Oxford University Press, 2005.
• (13) Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
• (14) Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In

International Conference on Machine Learning

, pages 151–159, 2013.
• (15) Pern Hui Chia. Colonel Blotto in web security. In The Eleventh Workshop on Economics and Information Security, WEIS Rump Session, pages 141–150, 2012.
• (16) Timothy H Chung, Geoffrey A Hollinger, and Volkan Isler. Search and pursuit-evasion in mobile robotics. Autonomous robots, 31(4):299, 2011.
• (17) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
• (18) Oliver Gross and Robert Wagner. A continuous Colonel Blotto game. U.S.Air Force Project RAND Research Memorandum, 1950.
• (19) JD Grote. The theory and application of differential games. Springer, 1975.
• (20) András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369–2403, 2007.
• (21) Joao P Hespanha, Maria Prandini, and Shankar Sastry. Probabilistic pursuit-evasion games: A one-step nash approach. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No. 00CH37187), volume 3, pages 2272–2277. IEEE, 2000.
• (22) Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
• (23) Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621, 2014.
• (24) Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Complexity of computing optimal stackelberg strategies in security resource allocation games. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
• (25) Dan Kovenock and Brian Roberson. Coalitional Colonel Blotto games with application to the economics of alliances. Journal of Public Economic Theory, 14(4):653–676, 2012.
• (26) Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.
• (27) Antonia Maria Masucci and Alonso Silva. Strategic resource allocation for competitive influence in social networks. In Allerton, pages 951–958, 2014.
• (28) Antonia Maria Masucci and Alonso Silva. Defensive resource allocation in social networks. In CDC, pages 2927–2932, 2015.
• (29) Vishnu Navda, Aniruddha Bohra, Samrat Ganguly, and Dan Rubenstein. Using channel hopping to increase 802.11 resilience to jamming attacks. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 2526–2530. IEEE, 2007.
• (30) Brian Roberson. The Colonel Blotto game. Economic Theory, 29(1):2–24, 2006.
• (31) Shinsaku Sakaue, Masakazu Ishihata, and Shin-ichi Minato. Efficient bandit combinatorial optimization algorithm with zero-suppressed binary decision diagrams. In International Conference on Artificial Intelligence and Statistics, pages 585–594, 2018.
• (32) Galina Schwartz, Patrick Loiseau, and Shankar S Sastry. The heterogeneous Colonel Blotto game. In NetGCoop, pages 232–238, 2014.
• (33) Eiji Takimoto and Manfred K Warmuth. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4(Oct):773–818, 2003.
• (34) Rene Vidal, Omid Shakernia, H Jin Kim, David Hyunchul Shim, and Shankar Sastry. Probabilistic pursuit-evasion games: theory, implementation, and experimental evaluation. IEEE transactions on robotics and automation, 18(5):662–669, 2002.
• (35) John Von Neumann. A certain zero-sum two-person game equivalent to the optimal assignment problem. Contributions to the Theory of Games, 2:5–12, 1953.
• (36) Dong Quan Vu, Patrick Loiseau, and Alonso Silva. Efficient computation of approximate equilibria in discrete Colonel Blotto games. In IJCAI-ECAI, July 2018.
• (37) Qingsi Wang and Mingyan Liu. Learning in hide-and-seek. IEEE/ACM Transactions on Networking, 24(2):1279–1292, 2016.
• (38) Wenyuan Xu, Wade Trappe, Yanyong Zhang, and Timothy Wood. The feasibility of launching and detecting jamming attacks in wireless networks. In Proceedings of the 6th ACM international symposium on Mobile ad hoc networking and computing, pages 46–57. ACM, 2005.
• (39) Yaakov Yavin. Pursuit–evasion differential games with deception or interrupted observation. In Pursuit-Evasion Differential Games, pages 191–203. Elsevier, 1987.
• (40) Chongjie Zhang, Victor Lesser, and Prashant Shenoy. A multi-agent learning approach to online distributed resource allocation. In Twenty-First International Joint Conference on Artificial Intelligence, 2009.

## Appendix A Weight Pushing for Path Sampling

We re-visit some useful results in the literature. In this section, we consider a DAG with parameters as introduced in Section 2. For simplicity, we assume that each edge in belongs to at least one path in . Let us respectively denote by and the set of the direct successors and the set of the direct predecessors of any vertex . Moreover, let and respectively denote the edge and the set of all paths from vertex to vertex . Let us consider a weight for each edge . It is needed in the Exp3-OE algorithm to sample a path with the probability:

 d(~p):=[∏e∈~pw(e)]/[∑p∈P∏e∈pw(e)]. (5)

A direct computation and sampling from takes time which is very inefficient. To efficiently sample the path, we first label the vertices set by such that if there exists an edge connecting to then . We then define the following terms for each vertex :

 H(s,u):=∑p∈Ps,u∏e∈pw(e) and H(u,d):=∑p∈Pu,d∏e∈pw(e).

Intuitively, is the aggregate weight of all paths from vertex to vertex  and is exactly the denominator in (5). These terms and can be recursively computed by Algorithm 3 that runs in time, through dynamic programming. This technique is called weight pushing and can be found in [20, 31, 33].

Based on Algorithm 3, we construct Algorithm 4 that uses the weights as inputs and randomly outputs a path in . Intuitively, starting from the root vertex , Algorithm 4 sequentially samples vertices by vertices based on the terms computed by Algorithm 3. It is noteworthy that Algorithm 4 also runs in time and it is trivial to prove that the probability that a path is sampled from Algorithm 4 matches exactly .

## Appendix B Proof of Algorithm 2’s Output

###### Proof.

Fixing an edge , we prove that when Algorithm 2 takes the edges weights as the input, it outputs exactly . We note that if , then .

We denote and label the edges in the set by . We let the for-loop in lines of Algorithm 2 consecutively run with the edges in as follows:

• After the for-loop runs for , we have ; therefore, since computed from the original weights . Due to line  that sets , henceforth in Algorithm 2, the weight of any path that contains is set to .

• Let the for-loop run for , we have because any path has the weight . Therefore, .

• Similarly, after the for-loop runs for (where ), we have:

 qt(e)=i∑k=1⎛⎜ ⎜ ⎜⎝∑{p∋ek}∖⋃j
• Therefore, after the for-loop finishes running for every edge in ; we have where each term was only counted once even if contains more than one edge that reveals the edge .

## Appendix C Proof of Theorem 3.1

See 3.1

###### Proof.

We first denote999We recall that . . From line of Algorithm 1, we trivially have:

 wt+1(p) =wt(p)⋅exp(−η^Lt(p)),∀p∈P,∀t∈[T−1]. (6)

Here, we recall , then from (2), we have:

 Et[^Lt(p)]≤Lt(p):=∑e∈pℓt(e),∀p∈P. (7)

Under the condition that , we obtain:

 Wt+1Wt=∑p∈Pwt+1(p)Wt =∑p∈Pwt(p)⋅exp(−η^Lt(p))Wt =∑p∈Pdt(p)⋅exp(−η^Lt(p))) ≤∑p∈P[dt(p)(1−η^Lt(p)+η22(^Lt(p))2)] =1−∑p∈P[dt(p)(η^Lt(p)−η22(^Lt(p))2)]. (8)

Here, the second equality comes from (6) and the inequality comes from the fact that for . From (8) and the inequality for any , we have the following inequality:101010We can easily check that for any .

 ln(WT+1W1)=T∑t=1ln(Wt+1Wt)≤T∑t=1(−η∑p∈Pdt(p)^Lt(p)+η22∑p∈Pdt(p)(^Lt(p))2). (9)

On the other hand, let us fix a path , then

 ln(WT+1W1)≥ln(wT+1(p∗)W1) =lnwT(p∗)exp(−η^LT(p∗))|P| =lnwT−1(p∗)exp(−η^LT(p∗)−η^LT−1(p∗))|P| =−ηT∑t=1^Lt(p∗)−ln(|P|). (10)

In the arguments leading to (10), we again use (6) and the fact that , including . Therefore, combining (9) and (10) then dividing both sides by , we have that

 T∑t=1∑p∈Pdt(p)^Lt(p)≤ln(|P|)η+T∑t=1^Lt(p∗)+η2T∑t=1∑p∈Pdt(p)(^Lt(p))2. (11)

Now, we take the expectation w.r.t. to the randomness in choosing on (11), then we apply (7) to obtain:

 T∑t=1∑p∈Pdt(p)Et[^Lt(p)]≤ln(|P|)η+T∑t=1Lt(