1 Introduction
Resource allocation games have been studied profoundly in the literature and showed to be very useful to model many practical situations, including online decision problems, see e.g. blocki2013audit ; bower2005resource ; korzhyk2010complexity ; zhang2009multi . In particular, two of the most renowned are the Colonel Blotto game (henceforth, CB game) and the HideandSeek game (henceforth, HS game). In the (oneshot) CB game, two players, each with a fixed amount of budget, simultaneously allocate their (indivisible) resources on battlefields, each player’s payoff is the aggregate of the values of battlefields where she has a higher allocation. The scope of applications of the CB games includes a variety of problems; for instance, in security (e.g., chia2012 ; schwartz2014 ) where resources correspond to security forces, in politics (e.g., kovenock2012 ; roberson2006 ) for allocating budget to attract voters, and in advertisement (e.g., masucci2014 ; masucci2015 ) for distributing the broadcasting time. On the other hand, in the (oneshot) HS game, a seeker chooses among locations () to search for a hider, who randomly chooses to hide in one of the
locations. The seeker’s payoff is the probability that she finds the hider and the hider’s payoff is the probability that she successfully escape the seeker’s pursuit. Several variants of the HS games are used to model surveillance situations
bhattacharya2014surveillance ; bhattacharya2009existence , antijamming problems in telecommunications navda2007using ; wang16 ; xu2005feasibility , vehicles control chung2011search ; vidal2002probabilistic , etc.Both the CB games and the HS games have a longstanding history (originated in 1921 borel1921 and 1953 vonneumann53 , respectively); however, the results achieved sofar in these games are mostly limited to their oneshot and fullinformation version (see e.g., Behnezhad17a ; grosswagner ; roberson2006 ; schwartz2014 ; Vu18a for CB games and grote1975theory ; hespanha2000probabilistic ; yavin1987pursuit for HS games). On the contrary, in most of the applications (e.g., web security, advertising, telecommunications), a more natural setting is to consider the case where the game is played repeatedly and players have access only to incomplete information at each stage. In this setting, players are often required to sequentially learn the game onthefly and adjust the tradeoff between exploiting known information and exploring to gain new information. Thus, this work focuses on the following sequential learning problem: at each stage, a learner plays a CB game (resp. HS game); at the end of the stage, she receives limited feedback that is the gain she obtains from each battlefield (resp. the hider’s escape probability corresponding to the chosen locations); and her objective is to maximize her cumulative payoffs. A formal definition of these problems is given in Section 4; hereinafter, we reuse the term CB game and HS game to refer to this sequential learning version of the games. The main challenge in those games is that their strategy space is exponential in the natural parameters (e.g., number of troops and battlefields in the CB game, number of locations in the HS game); hence how to efficiently learn in these games is an open question.
Our first contribution towards solving this open question is to show that the CB and HS games can be cast as Path Planning Problems (henceforth, PPP), one of the most wellstudied instances of the
Online Combinatorial Optimization
framework (henceforth, OComb; see chen2013combinatorial for a survey). In PPPs, given a graph with edges, at each stage, a learner chooses a path; then a loss in is adversarially chosen for each edge and the learner suffers the aggregate of edges’ losses belonging to her chosen path. The learner’s goal is to minimize regret.^{1}^{1}1The regret is the difference between the learner’s cumulative loss and that of the best action in hindsight. The information that the learner receives in the CB and HS games as described above straightforwardly corresponds to the socalled semibandit feedback setting of PPPs, i.e., at the end of each stage, the learner observes the edges’ losses belonging to her chosen path. However, the specific structure of the considered games also allows the learner to deduce (without any extra cost) from the semibandit feedback the losses of some of the other edges that may not belong to the chosen path; these are called sideobservations. Henceforth, we will use the term SOPPP to refer to this PPP under semibandit feedback with sideobservations.SOPPP is a special case of OComb with sideobservations (henceforth, SOComb) studied by kocak14 and, following their approach, we will use observation graphs^{2}^{2}2The observation graphs, proposed in kocak14 and used here for SOPPP, extend the sideobservations model for multiarmed bandits problems studied by alon15 ; alon13 ; mannor11 . Indeed, they capture sideobservations between edges whereas the sideobservations model considered in alon15 ; alon13 ; mannor11 is between actions (i.e., paths in PPPs). (defined in Section 2) to capture the learner’s observability. In kocak14 , the authors focus on the class of FollowthePerturbedLeader (FPL) algorithms (originated from kalai2005efficient ) and propose an algorithm named FPLIX for SOComb, which could be applied directly to SOPPP. However, this faces two main problems: () the efficiency of FPLIX is only guaranteed with highprobability (as it depends on the geometric sampling technique) and () it requires that there exists an efficient oracle that solves an optimization problem at each stage—both of which are incompatible with our goal of learning in the CB and HS games.
In this paper, we focus instead on another prominent class of OComb algorithms, called Exp3 auer02b ; Freund1997 . Then, our second contribution is to propose an algorithm for SOPPP that solves both of the aforementioned issues and provides good regret guarantees. In more details, this contribution is threefold: We propose a novel algorithm, Exp3OE, that is applicable to any instance of SOPPP. Importantly, Exp3OE is always guaranteed to run efficiently (i.e., in polynomial time in terms of the number of edges of the graph in SOPPP) without the need of any auxiliary oracle; We prove that Exp3OE guarantees an upperbound on the expected regret matching in order with the best benchmark in the literature (the FPLIX algorithm). We also prove further improvements under additional assumptions on the observation graphs that have been sofar ignored in the literature; We demonstrate the benefit of using the Exp3OE algorithm in the CB and HS games.
Our Exp3OE algorithm is based on the Exp3IX algorithm kocak14 . However, Exp3IX has a very inefficient running time in SOComb (and particularly in SOPPP) and thus, it is only analyzed by kocak14 in the trivial cases of SOComb involving only actions with L1norm that equals to 1 (corresponding to SOPPP with graphs where all paths have length )—the existence of an efficient implementation of Exp3type algorithms in SOComb is left as an open question in kocak14 . We address this question in the particular case of SOPPP as follows. We introduce two main major updates in Exp3OE. First, unlike Exp3IX that uses adaptive implicit exploration scheme, we assume that the time horizon is known^{3}^{3}3If is unknown, we can use the doubling trick (see auer1995gambling ; besson2018doubling ) to get similar results.
in advance and fix an implicit exploration parameter in the loss estimator of
Exp3OE. This change reduces the computations and leads to a different parameters tuning scheme with improved regret bounds compared to Exp3IX. Second (and the main reason that makes Exp3OE significantly more efficient than Exp3IX), we use a novel loss estimator, which enables us to efficiently compute it based on a dynamicprogramming technique, called weight pushing. Note that while weight pushing has been used for efficiently sampling paths from exponentiallyupdated weights in several variants of Exp3 (e.g., gyorgy2007 ; sakaue2018 ; takimoto2003 ), the way we apply it to compute the loss estimator is novel and nontrivial. Finally, note that the SOPPP model (and thus, our proposed Exp3OE algorithm) can be applied into many problems beyond the considered games, e.g., auctions, recommendation systems.Throughout the paper, we use bold symbols to denote vectors, e.g.,
, and to denote the th element. For any , the set is denoted by and the indicator function of a set is denoted by . For graphs, we write either or to refer that an edge belongs to a path . For the sake of conciseness, we present first our second contribution on the SOPPP in general and we then return in Section 4 to our first contribution relating to the CB and HS games.2 Path Planning Problems with SideObservations (SOPPP) Formulation
As discussed in Section 1, motivated by the CB and HS games, we focus on the path planning problem with semibandit and sideobservations feedback (SOPPP) and design an Exp3type algorithm that always runs efficiently in SOPPP. To do this, we first formally define the SOPPP model as follows.
SOPPP model. Consider a directed acyclic graph (henceforth, DAG), denoted by , whose set of vertices and set of edges are respectively denoted by and . Let and ; there are two special vertices, a source and a destination, that are respectively called and . We denote by the set of all paths starting from and ending at . Each path corresponds to a vector in (thus, ) where if and only if edge belongs to . Let be the length of the longest path in , that is . Given a time horizon , at each (discrete) stage , a learner chooses a path . Then, a loss vector is secretly and adversarially chosen (oblivious from the learner’s decisions). Each element corresponds to the scalar loss embedded on the edge . The learner’s incurred loss is , i.e., the sum of the losses from all the edges belonging to . The learner’s feedback at stage after choosing is presented as follows. First, she receives a semibandit feedback, that is, she observes all the edges’ losses , for any belonging to the chosen path . Additionally, each edge may reveal the losses on several other edges. To represent these sideobservations at time , we consider a graph, denoted , containing vertices. Each vertex of corresponds to an edge of the graph . There exists a directed edge from a vertex to a vertex in if, by observing the edge loss , the learner can also deduce the edge loss (we also denote this by and say that the edge reveals the edge ). The objective of the learner is to minimize the cumulative expected regret, defined as .
Hereinafter, in places where there is no ambiguity, we use the term path to refer to a path in and the term observation graphs to refer to . In general, these observation graphs can depend on the decisions of both the learner and the adversary. On the other hand, all vertices in always have selfloops. In the case where none among contains any other edge than these selfloops, no sideobservation is allowed and the problem is reduced to the classical semibandit setting. If all are complete graphs, SOPPP corresponds to the fullinformation PPPs. In this work, we focus on considering the uninformed setting, i.e., the learner observes only after making a decision at time . On the other hand, let us introduce two new notations:
Intuitively, is the set of all paths that, if chosen, reveal the loss on the edge and is the set of all edges whose losses are revealed if the path is chosen. Trivially, . Moreover, due to the semibandit feedback, if , then and . Apart from the results for general observation graphs, in this work, we additionally present several results under two particular assumptions, satisfied by some instances in practice (e.g., the CB and HS games), that provide more refined regret bounds compared to cases that were considered in kocak14 : symmetric observation graphs where for each edge from to , there also exists an edge from to (i.e., if then ); i.e., is an undirected graph; observation graphs under the following assumption that requires that if two edges belong to a path in , then they cannot simultaneously reveal the loss of another edge.

For any , if and , then .
3 Exp3OE  An Efficient Algorithm for the SOPPP
In this section, we present a new algorithm for SOPPP, called Exp3OE (OE stands for Observable Edges), whose pseudocode is given by Algorithm 1. The guarantees on the expected regret of Exp3OE in SOPPP is analyzed in Section 3.2. More importantly, Exp3OE always runs efficiently in polynomial time in terms of the number of edges of ; this is discussed in Section 3.1.
As an Exp3type algorithm, Exp3OE relies on the average weights sampling where at stage we update the weight on each edge by the exponential rule (line ). For each path , we denote the path weight and define the following normalized terms, according to which a path is sampled at each stage (see line ) of the Exp3OE algorithm:
(1) 
Compared to other instances of the Exp3type algorithms, Exp3OE has two major differences. First, at each stage , the loss of each edge is estimated by (line ) based on the term and a parameter . Intuitively, is the probability that the loss on the edge is revealed from playing the chosen path at . On the other hand, the implicit exploration parameter added to the denominator allows us to “pretend to explore" in Exp3OE without knowing the observation graph before making the decision at stage (the uninformed setting). Unlike the standard Exp3 algorithm, the loss estimator used in Exp3OE is biased, that is
(2) 
Here, denotes the expectation w.r.t. the randomness of choosing a path at stage . Second, unlike standard Exp3 algorithms that keep track and update on the weight of each path, the weight pushing technique is applied at line (via Algorithm 4 in Appendix A) and line (via Algorithm 2 in Section 3.1) where we work with edges weights instead of paths weights (recall that ).
3.1 Running Time Efficiency of the Exp3OE Algorithm
We recall that in order to efficiently sample a path according to , following the literature, it is useful to compute the terms and for any vertex in . Intuitively, is the aggregate weight of all paths from vertex to vertex at stage . Then, a path in is sampled sequentially edgebyedge based on these terms . The collection of the computations described above is often referred to as weight pushing, that can be done in by exploiting the structure of the graph. We rewrite this step formally in Appendix A.
The final nontrivial step to efficiently implement Exp3OE is to compute , the probability that an edge is revealed at stage , needed in line . We note that is the sum of terms; therefore, a direct computation is inefficient while a naive application of the weight pushing technique can easily lead to errors. To compute , we propose Algorithm 2, a nonstraightforward application of weight pushing, in which we consecutively consider all the edges . Then, we take the sum of the terms of the paths going through by the weight pushing technique while making sure that each of these terms is only included one time, even if has more than one edge revealing (this is a nontrivial step). In Algorithm 2, we denote by the set of the direct successors of any vertex . A proof that Algorithm 2 outputs exactly as defined in line of Algorithm 1 can be found in Appendix B. Algorithm 2 runs in time; therefore, line of Algorithm 1 can be done in at most time. In conclusion, the Exp3OE algorithm runs in at most time, this guarantee works even for the worstcase scenario. For comparison, the running time of FPLIX proposed by kocak14 is in expectation if we choose Dijkstra’s algorithm to be the optimization oracle at each stage. On the other hand, with the chosen parameters in kocak14 , we can deduce that FPLIX achieves the running time in^{4}^{4}4The notation is a version of the bigO asymptotic notation that ignores the logarithmic terms. with a probability at least for an arbitrary . That is, FPLIX is not guaranteed to have efficient running time in all cases.
3.2 Performance of the Exp3OE Algorithm
In this section, we present an upperbound of the expected regret achieved by the Exp3OE algorithm in the SOPPP. For the sake of brevity, with defined in (1), for any and , we denote:
(3) 
Intuitively, is the probability that the chosen path at stage contains an edge and is the summation over all the edges of the ratio of this quantity and the probability that the loss of an edge is revealed (plus ). We can bound the expected regret with this key term .
Theorem 3.1.
The expected regret of the Exp3OE algorithm in the SOPPP satisfies:
(4) 
The proof of Theorem 3.1 is given in Appendix C and has an approach similar to alon13 ; cesa2012 with several necessary adjustments to handle the new biased loss estimator in Exp3OE. To see the relationship between the structure of the sideobservations of the learner and the bound of the expected regret, we look for the upperbounds of in terms of the observation graphs’ parameters. Let be the independence number^{5}^{5}5The independence number of a directed graph is computed while ignoring the direction of the edges. of , we have the following statement.
Theorem 3.2.
Let us denote , and , Upperbounds of in different cases of are given in the following table:
satisfies  not satisfies  

Symmetric  
NonSymmetric 
A proof of this theorem is given in Appendix E. The main idea of this proof is based on several graph theoretical lemmas that are extracted from alon13 ; kocak14 ; mannor11 . These lemmas establish the relationship between the independence number of a graph and the ratios of the weights on the graph’s vertices that have similar forms to the keyterm . The case where observation graphs are nonsymmetric and do not satisfy assumption is the most general setting. Moreover, as showed in Theorem 3.2, the bounds of are improved if the observation graphs satisfy either the symmetry condition or assumption . Intuitively, given the same independence numbers, a symmetric observation graph gives the learner more information than a nonsymmetric one; thus, it may yield a better bound on and the expected regret. On the other hand, assumption is a technical assumption that allows the use of different techniques in the proofs to obtain better bounds. These cases have not been analyzed in the literature while they are satisfied by several practical situations, including the CB and HS games (see Section 4).
Finally, we give results on the order of the upperbounds of the expected regret, obtained by the Exp3OE algorithm, presented as a corollary of Theorems 3.1 and 3.2.
Corollary 3.3.
In SOPPP, let be an upper bound of . With appropriate choices of the parameters and , the expected regret of the Exp3OE algorithm is:

in the general cases.

if assumption is satisfied by the observation graphs .
The choices of the parameters and (which are nontrivial in the cases where the observation graphs are nonsymmetric) that yield these results will be given in Appendix F. We also note that a trivial upperbound of is the number of vertices of the graph which is (the number of edges in ). In general, the more connected is, the smaller may be chosen; and thus the better upperbound of the expected regret. In the (classical) semibandit setting, and in the fullinformation setting, . Finally, we also note that, if (this is typical in practice, including the CB and HS games), the bound in Corollary 3.3 matches in order with the bounds (ignoring the logarithmic factors) given by the FPLIX algorithm (see kocak14 ). On the other hand, the form of the regret bound provided by the Exp3IX algorithm (see kocak14 ) does not allow us to compare directly with the bound of Exp3OE in the general SOPPP. In kocak14 , Exp3IX is only analyzed when , i.e., ; in this case, we observe that the bound given by our Exp3OE algorithm is better than that of Exp3IX (by some multiplicative constants).
4 Colonel Blotto Games and HideandSeek Games as SOPPP
Given the regret analysis of Exp3OE in SOPPP, we now return to to our main motivation, the Colonel Blotto and the HideandSeek games, and discuss how to apply our findings to these games. To address this, we define formally the online version of the games and show how these problems can be formulated as SOPPP in Sections 4.1 and 4.2, then we demonstrate the benefit of using the Exp3OE algorithm for learning in these games (Section 4.3).
4.1 Colonel Blotto Games as an SOPPP
The online Colonel Blotto game. This is a game between a learner and an adversary over battlefields within a time horizon . Each battlefield has a value (unknown to the learner) at stage such that . At stage , the learner needs to distribute troops ( is fixed) towards the battlefields while the adversary simultaneously allocate hers. The learner’s strategy set is . At stage and battlefield , if the adversary’s allocation is strictly larger than the learner’s allocation, the learner loses this battlefield and she suffers the loss ; if they have tie allocations, she suffers the loss ; otherwise, she wins and suffers no loss. At the end of stage , the learner observes the loss from each battlefield (and which battlefield she wins, ties, or loses) but not the adversary’s allocations. The learner’s loss at each time is the sum of the losses from all the battlefields. The objective of the learner is then to minimize her loss over a finite period of time.
While this problem can be formulated as a standard OComb, it is difficult to derive an efficient learning algorithm under that formulation, due to the learner’s exponentially large set of strategies that she can choose from per stage. Instead, we show that by reformulating the problem as an SOPPP, we will be able to exploit the advantages of the Exp3OE algorithm to solve it. To do so, first note that the learner can deduce several sideobservations as follows: if she allocates troops to battlefield and wins, she knows that if she had allocated more than troops to , she would also have won; if she knows the allocations are tie at battlefield , she knows exactly the adversary’s allocation to this battlefield and deduce all the losses she might have suffered if she had allocated differently to battlefield ; if she allocates troops to battlefield and loses, she knows that if she had allocated less than to battlefield , she would also have lost.
Now, to cast the CB game as SOPPP, for each instance of the parameters and , we create a DAG such that the strategy set has a onetoone correspondence to the paths set of . The formal definition of will be given in Appendix G; due to the lack of space, we only present here an example illustrating the graph of an instance of the CB game in Figure 3(a). The graph has edges and paths while the length of every path is . Each edge in corresponds to allocating a certain amount of troops to a battlefield. Therefore, the CB game model is equivalent to a PPP where at each stage the learner chooses a path in and the loss on each edge is generated from the allocations of the adversary and the learner (corresponding to that edge) according to the rules of the game. At stage , the (semibandit) feedback and the sideobservations^{6}^{6}6E.g., in Figure 3(a), if the learner chooses a path going through edge (corresponding to allocating troop to battlefield ) and wins (thus, the loss at edge is ), then she deduces that the losses on the edges , and (corresponding to allocating at least troop to battlefield ) are all . deduced by the learner as described above infers an observation graph . This formulation indeed transforms any CB game into an SOPPP.
Note that since there are edges in that refer to the same allocation (e.g., the edges , and in all refer to allocating troops to battlefield ), in the observation graphs, the vertices corresponding to these edges are always connected. Therefore, an upper bound of the independence number of in the CB game is . Moreover, we can verify that the observation graph of the CB game satisfies assumption for any and it is nonsymmetric.
4.2 HideandSeek Games as an SOPPP
The online HideandSeek game. This is a repeated game (within the time horizon ) between a hider and a seeker. In this work, we consider that the learner plays the role of the seeker and the hider is the adversary. There are locations, indexed from to . At each stage , the learner sequentially chooses locations, called an search, to seek for the hider, that is, she chooses an (if , we say that location is her th move). The hider maliciously assigns losses on all locations (intuitively, these losses can be the wasted time supervising a mismatch location or the probability that the hider does not hide there, etc.). In this work, we consider the following condition on how the hider/adversary assigns the losses on the locations.

At stage , the adversary secretly assigns a loss to each location (unknown to the learner). These losses are fixed throughout the search of the learner.
The learner’s loss at stage is the sum of the losses from her chosen locations in the search at stage , that is . Moreover, often in practice the search of the learner needs to satisfy some constraints. In this work, as an example, we use the following constraint: for a fixed (called the coherence constraint), i.e., the seeker cannot search too far away from her previously chosen location.^{7}^{7}7Our results can be applied to HS games with other constraints, such as , i.e., she can only search forward; or, , i.e., she cannot search a location more than times, etc. At the end of stage , the learner only observes the losses from the locations she chose among her search, and her objective is to minimize her total loss over .
Similar to the case of the CB game, tackling the HS game as a standard OComb is computationally involved. As such, we follow the SOPPP formulation instead. In particular, knowing that the adversary follows condition , the learner can deduce the following sideobservations: within a stage, the loss at each location remains the same no matter when it is chosen among the search; that is, knowing the loss of choosing location as her th move, the learner knows all the loss if she chooses location as her th move for any . Given this, we create a DAG whose paths set has a onetoone correspondence to the set containing all feasible search of the learner in the HS game with locations under coherent constraint. A formal definition of is given in Appendix G. The HS game is equivalent to the PPP where the learner chooses a path in and edges’ losses are generated by the adversary at each stage (note that to ensure all paths end at , there are auxiliary edges in that are always embedded with losses). Figure 3(b) illustrates the corresponding graph of an instance of the HS game. We note that there are edges and paths in .
The semibandit feedback and sideobservations as described above generate an observation graph at time (e.g., in Figure 3(b), the edges , and represent that location is chosen; thus, they mutually reveal each other). The independence number of is for any . We note that the observation graphs of the HS game are symmetric and do not satisfy assumption . Finally, we consider a relaxation of condition :

At stage , the adversary assigns a loss on each location . For , after the learner chooses, say location , as her th move, the adversary can observe that and change the losses for any location that has not been searched before by the learner,^{8}^{8}8An interpretation is that by searching a location, the learner/seeker “discovers and secures" that location; therefore, the adversary/hider cannot change her assigned loss at that place. i.e., she can change the losses .
By replacing condition with condition , we can limit the sideobservations of the learner: she can only deduce that if , the edges in representing choosing a location as the move reveals the edges representing choosing that same location as the th move; but not vice versa. In this case, the observation graph only contains directed edges; however, its independence number is still as in the HS games with condition .
4.3 Performance of Exp3OE in the Colonel Blotto and HideandSeek Games
Having formulated the CB game and the HS game as SOPPPs, we can use the Exp3OE algorithm to achieve the following results (deduced directly from Corollary 3.3).
Corollary 4.1.
The expected regret of the Exp3OE algorithm satisfies:

in the CB games with troops and battlefields.

in the HS games with locations and search.
At a highlevel, given the same scale on their inputs, the independence numbers of the observation graphs in HS games are smaller than in CB games (by a multiplicative factor of ). However, since assumption is satisfied by the observation graphs of the CB games and not by the HS games, the expected regret bounds of the Exp3OE algorithm in these games have the same order of magnitude. From Corollary 4.1, we note that in the CB games, the order of the regret bounds given by Exp3OE is better than that of the FLPIX algorithm (thanks to the fact that is satisfied). On the other hand, in the HS games with condition involving symmetric observation graphs, the regret bounds of the Exp3OE algorithm improves the bound of FPLIX but they are still in the same order of the games’ parameters (ignoring the logarithmic factors). Finally, we compare the regret guarantees given by our Exp3OE algorithm and by the Online Stochastic Mirror Descent algorithm (henceforth, OSMD; see AudibertBL2014 )—the benchmark algorithm for OComb with semibandit feedback (although OSMD does not run efficiently in general). Applying OSMD to the CB and HS games (as SOPPP), the sideobservations are ignored and the expected regret bound guaranteed by OSMD is in . Using the parameters and chosen for Corollary 3.3 and 4.1 (see Appendix F) in the corresponding cases of the observation graphs, the Exp3OE algorithm provides a better upperbound of the expected regret than OSMD in the CB games if ; in the HS games with condition if ; and in the HS games with condition if . A proof of this statement is given in Appendix H.
5 Conclusion
In this work, we introduce the Exp3OE algorithm for the path planning problem with semibandit feedback and sideobservations. Exp3OE is always efficiently implementable. Moreover, it matches the regret guarantees compared to that of the FPLIX algorithm. We apply our findings to derive the first solutions to the online version of the Colonel Blotto and HideandSeek games. This work also extends the scope of application of the PPP model in practice, even for large instances.
References
 (1) Noga Alon, Nicolo CesaBianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In JMLR Workshop and Conference Proceedings, volume 40. Microtome Publishing, 2015.
 (2) Noga Alon, Nicolo CesaBianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618, 2013.
 (3) JeanYves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
 (4) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multiarmed bandit problem. In focs, page 322. IEEE, 1995.
 (5) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 (6) Soheil Behnezhad, Sina Dehghani, Mahsa Derakhshan, MohammadTaghi HajiAghayi, and Saeed Seddighin. Faster and simpler algorithm for optimal strategies of Blotto game. In AAAI, pages 369–375, 2017.
 (7) Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multiarmed bandits. arXiv preprint arXiv:1803.06971, 2018.

(8)
Sourabh Bhattacharya, Tamer Başar, and Maurizio Falcone.
Surveillance for security as a pursuitevasion game.
In
International Conference on Decision and Game Theory for Security
, pages 370–379. Springer, 2014.  (9) Sourabh Bhattacharya and Seth Hutchinson. On the existence of nash equilibrium for a two player pursuitevasion game with visibility constraints. In Algorithmic Foundation of Robotics VIII, pages 251–265. Springer, 2009.

(10)
Jeremiah Blocki, Nicolas Christin, Anupam Datta, Ariel D Procaccia, and Arunesh
Sinha.
Audit games.
In
TwentyThird International Joint Conference on Artificial Intelligence
, 2013.  (11) Emile Borel. La théorie du jeu et les équations intégrales à noyau symétrique. Comptes rendus de l’Académie des Sciences, 173(13041308):58, 1921.
 (12) Joseph L Bower and Clark G Gilbert. From resource allocation to strategy. Oxford University Press, 2005.
 (13) Nicolo CesaBianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.

(14)
Wei Chen, Yajun Wang, and Yang Yuan.
Combinatorial multiarmed bandit: General framework and applications.
In
International Conference on Machine Learning
, pages 151–159, 2013.  (15) Pern Hui Chia. Colonel Blotto in web security. In The Eleventh Workshop on Economics and Information Security, WEIS Rump Session, pages 141–150, 2012.
 (16) Timothy H Chung, Geoffrey A Hollinger, and Volkan Isler. Search and pursuitevasion in mobile robotics. Autonomous robots, 31(4):299, 2011.
 (17) Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 (18) Oliver Gross and Robert Wagner. A continuous Colonel Blotto game. U.S.Air Force Project RAND Research Memorandum, 1950.
 (19) JD Grote. The theory and application of differential games. Springer, 1975.
 (20) András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369–2403, 2007.
 (21) Joao P Hespanha, Maria Prandini, and Shankar Sastry. Probabilistic pursuitevasion games: A onestep nash approach. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No. 00CH37187), volume 3, pages 2272–2277. IEEE, 2000.
 (22) Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 (23) Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621, 2014.
 (24) Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Complexity of computing optimal stackelberg strategies in security resource allocation games. In TwentyFourth AAAI Conference on Artificial Intelligence, 2010.
 (25) Dan Kovenock and Brian Roberson. Coalitional Colonel Blotto games with application to the economics of alliances. Journal of Public Economic Theory, 14(4):653–676, 2012.
 (26) Shie Mannor and Ohad Shamir. From bandits to experts: On the value of sideobservations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.
 (27) Antonia Maria Masucci and Alonso Silva. Strategic resource allocation for competitive influence in social networks. In Allerton, pages 951–958, 2014.
 (28) Antonia Maria Masucci and Alonso Silva. Defensive resource allocation in social networks. In CDC, pages 2927–2932, 2015.
 (29) Vishnu Navda, Aniruddha Bohra, Samrat Ganguly, and Dan Rubenstein. Using channel hopping to increase 802.11 resilience to jamming attacks. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 2526–2530. IEEE, 2007.
 (30) Brian Roberson. The Colonel Blotto game. Economic Theory, 29(1):2–24, 2006.
 (31) Shinsaku Sakaue, Masakazu Ishihata, and Shinichi Minato. Efficient bandit combinatorial optimization algorithm with zerosuppressed binary decision diagrams. In International Conference on Artificial Intelligence and Statistics, pages 585–594, 2018.
 (32) Galina Schwartz, Patrick Loiseau, and Shankar S Sastry. The heterogeneous Colonel Blotto game. In NetGCoop, pages 232–238, 2014.
 (33) Eiji Takimoto and Manfred K Warmuth. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4(Oct):773–818, 2003.
 (34) Rene Vidal, Omid Shakernia, H Jin Kim, David Hyunchul Shim, and Shankar Sastry. Probabilistic pursuitevasion games: theory, implementation, and experimental evaluation. IEEE transactions on robotics and automation, 18(5):662–669, 2002.
 (35) John Von Neumann. A certain zerosum twoperson game equivalent to the optimal assignment problem. Contributions to the Theory of Games, 2:5–12, 1953.
 (36) Dong Quan Vu, Patrick Loiseau, and Alonso Silva. Efficient computation of approximate equilibria in discrete Colonel Blotto games. In IJCAIECAI, July 2018.
 (37) Qingsi Wang and Mingyan Liu. Learning in hideandseek. IEEE/ACM Transactions on Networking, 24(2):1279–1292, 2016.
 (38) Wenyuan Xu, Wade Trappe, Yanyong Zhang, and Timothy Wood. The feasibility of launching and detecting jamming attacks in wireless networks. In Proceedings of the 6th ACM international symposium on Mobile ad hoc networking and computing, pages 46–57. ACM, 2005.
 (39) Yaakov Yavin. Pursuit–evasion differential games with deception or interrupted observation. In PursuitEvasion Differential Games, pages 191–203. Elsevier, 1987.
 (40) Chongjie Zhang, Victor Lesser, and Prashant Shenoy. A multiagent learning approach to online distributed resource allocation. In TwentyFirst International Joint Conference on Artificial Intelligence, 2009.
Appendix A Weight Pushing for Path Sampling
We revisit some useful results in the literature. In this section, we consider a DAG with parameters as introduced in Section 2. For simplicity, we assume that each edge in belongs to at least one path in . Let us respectively denote by and the set of the direct successors and the set of the direct predecessors of any vertex . Moreover, let and respectively denote the edge and the set of all paths from vertex to vertex . Let us consider a weight for each edge . It is needed in the Exp3OE algorithm to sample a path with the probability:
(5) 
A direct computation and sampling from takes time which is very inefficient. To efficiently sample the path, we first label the vertices set by such that if there exists an edge connecting to then . We then define the following terms for each vertex :
Intuitively, is the aggregate weight of all paths from vertex to vertex and is exactly the denominator in (5). These terms and can be recursively computed by Algorithm 3 that runs in time, through dynamic programming. This technique is called weight pushing and can be found in [20, 31, 33].
Based on Algorithm 3, we construct Algorithm 4 that uses the weights as inputs and randomly outputs a path in . Intuitively, starting from the root vertex , Algorithm 4 sequentially samples vertices by vertices based on the terms computed by Algorithm 3. It is noteworthy that Algorithm 4 also runs in time and it is trivial to prove that the probability that a path is sampled from Algorithm 4 matches exactly .
Appendix B Proof of Algorithm 2’s Output
Proof.
Fixing an edge , we prove that when Algorithm 2 takes the edges weights as the input, it outputs exactly . We note that if , then .
We denote and label the edges in the set by . We let the forloop in lines – of Algorithm 2 consecutively run with the edges in as follows:

After the forloop runs for , we have ; therefore, since computed from the original weights . Due to line that sets , henceforth in Algorithm 2, the weight of any path that contains is set to .

Let the forloop run for , we have because any path has the weight . Therefore, .

Similarly, after the forloop runs for (where ), we have:

Therefore, after the forloop finishes running for every edge in ; we have where each term was only counted once even if contains more than one edge that reveals the edge .
∎
Appendix C Proof of Theorem 3.1
See 3.1
Proof.
We first denote^{9}^{9}9We recall that . . From line of Algorithm 1, we trivially have:
(6) 
Here, we recall , then from (2), we have:
(7) 
Under the condition that , we obtain:
(8) 
Here, the second equality comes from (6) and the inequality comes from the fact that for . From (8) and the inequality for any , we have the following inequality:^{10}^{10}10We can easily check that for any .
(9) 
Comments
There are no comments yet.