1 Introduction
Resource allocation games have been studied profoundly in the literature and showed to be very useful to model many practical situations, including online decision problems, see e.g. [7, 9, 20, 35]. In particular, two of the most renowned are the Colonel Blotto game (henceforth, CB game) and the HideandSeek game (henceforth, HS game). In the (oneshot) CB game, two players, each with a fixed amount of budget, simultaneously allocate their (indivisible) resources on battlefields, each player’s payoff is the aggregate of the values of battlefields where she has a higher allocation. The scope of applications of the CB games includes a variety of problems; for instance, in security where resources correspond to security forces (e.g., [12, 27]), in politics where budget are distributed to attract voters (e.g., [21, 25]), and in advertising for distributing the ads’ broadcasting time (e.g., [23, 24]). On the other hand, in the (oneshot) HS game, a seeker chooses among locations (
) to search for a hider, who chooses the probability of hiding in each location. The seeker’s payoff is the summation of the probability that the hider hides in the chosen locations and the hider’s payoff is the probability that she successfully escapes the seeker’s pursuit. Several variants of the HS games are used to model surveillance situations
[6], antijamming problems [32], vehicles control [29], etc.Both the CB and the HS games have a longstanding history (originated by [8] and [30], respectively); however, the results achieved sofar in these games are mostly limited to their oneshot and fullinformation version (see e.g., [5, 15, 25, 27, 31] for CB games and [17, 33] for HS games). On the contrary, in most of the applications (e.g., telecommunications, web security, advertising), a more natural setting is to consider the case where the game is played repeatedly and players have access only to incomplete information at each stage. In this setting, players are often required to sequentially learn the game onthefly and adjust the tradeoff between exploiting known information and exploring to gain new information. Thus, this work focuses on the following sequential learning problems:
The online CB game: fix ; at each stage, a learner who has the budget plays a CB game against some adversaries across battlefields; at the end of the stage, she receives limited feedback that is the gain (loss) she obtains from each battlefield (but not the adversaries’ strategies). The battlefields’ values can change over time and they are unknown to the learner before making the decision at each stage. This setting is generic and covers many applications of the CB game. For instance, in radio resource allocation problem (in a cognitive radio network), a solution that balances between efficiency and fairness is to provide the users fictional budgets (the same budget at each stage) and let them bid across spectrum carriers simultaneously to compete for obtaining as many bandwidth portions as possible, the highest bidder to each carrier wins the corresponding bandwidth (see e.g., [13]). At the end of each stage, each user observes her own data rate (the gain/loss) achieved via each carrier (corresponding to battlefields’ values) but does not know other users’ bids. Note that the actual data rate can be noisy and change over time. Moreover, users can enter and leave the system so no stochastic assumption shall be made for the adversaries’ decisions.
The online HS game: fix (such that ); at each stage, the learner is a seeker who plays the same HS game (with and ) against an adversary; at the end of the stage, the seeker only observes the gains/losses she suffers from the locations she chose. This setting is practical and one of the motivational examples is the spectrum sensing problem in opportunistic spectrum access context (see e.g., [34]). At each stage, a secondary user (the learner) chooses to send the sensing signal to at most among channels (due to energy constraints, she cannot sense all channels) with the objective of sensing the channels with the availability as high as possible. The leaner can only measure the reliably (the gain/loss) of the channels that she sensed. Note that the channels’ availability depend on primary users’ decisions that is nonstochastic.
A formal definition of these problems is given in Section 4; hereinafter, we reuse the term CB game and HS game to refer to this sequential learning version of the games. The main challenge here is that the strategy space is exponential in the natural parameters (e.g., number of troops and battlefields in the CB game, number of locations in the HS game); hence how to efficiently learn in these games is an open question.
Our first contribution towards solving this open question is to show that the CB and HS games can be cast as a Path Planning Problem (henceforth, PPP), one of the most wellstudied instances of the
Online Combinatorial Optimization
framework (henceforth, OComb; see [11] for a survey). In PPPs, given a graph with edges, at each stage, a learner chooses a path; then a loss is adversarially chosen for each edge and the learner suffers the aggregate of edges’ losses belonging to her chosen path. The learner’s goal is to minimize regret. The information that the learner receives in the CB and HS games as described above straightforwardly corresponds to the socalled semibandit feedback setting of PPPs, i.e., at the end of each stage, the learner observes the edges’ losses belonging to her chosen path (see Section 4 for more details). However, the specific structure of the considered games also allows the learner to deduce (without any extra cost) from the semibandit feedback the losses of some of the other edges that may not belong to the chosen path; these are called sideobservations. Henceforth, we will use the term SOPPP to refer to this PPP under semibandit feedback with sideobservations.SOPPP is a special case of OComb with sideobservations (henceforth, SOComb) studied by [19] and, following their approach, we will use observation graphs^{1}^{1}1The observation graphs, proposed by [19] and used here for SOPPP, extend the sideobservations model for multiarmed bandits problems studied by [1, 2, 22]. Indeed, they capture sideobservations between edges whereas the sideobservations model considered by [1, 2, 22] is between actions, i.e., paths in PPPs. (defined in Section 2) to capture the learner’s observability. [19] focuses on the class of FollowthePerturbedLeader (FPL) algorithms (originated from [18]) and proposes an algorithm named FPLIX for SOComb, which could be applied directly to SOPPP. However, this faces two main problems: () the efficiency of FPLIX is only guaranteed with highprobability (as it depends on the geometric sampling technique) and it is still superlinear in terms of the time horizon, thus there is still room for improvements; () FPLIX requires that there exists an efficient oracle that solves an optimization problem at each stage. Both of these issues are incompatible with our goal of learning in the CB and HS games: although the probability that FPLIX fails to terminate is small, this could lead to issues in implementing it in practice where the learner is obliged to quickly give a decision in each stage; it is unclear which oracle should be used in applying FPLIX to the CB and HS games.
In this paper, we focus instead on another prominent class of OComb algorithms, called Exp3 [4, 14]. One of the key open questions in this field is how to design a variant of Exp3 with efficient running time and good regret guarantees for OComb problems in each feedback setting (see, e.g., [10]). Then, our second contribution is to propose an Exp3type algorithm for SOPPPs that solves both of the aforementioned issues of FPLIX and provides good regret guarantees; i.e., we give an affirmative answer to an important subset of the abovementioned open problem. In more details, this contribution is threefold: We propose a novel algorithm, Exp3OE, that is applicable to any instance of SOPPP. Importantly, Exp3OE is always guaranteed to run efficiently (i.e., in polynomial time in terms of the number of edges of the graph in SOPPP) without the need of any auxiliary oracle; We prove that Exp3OE guarantees an upperbound on the expected regret matching in order with the best benchmark in the literature (the FPLIX algorithm). We also prove further improvements under additional assumptions on the observation graphs that have been sofar ignored in the literature; We demonstrate the benefit of using the Exp3OE algorithm in the CB and HS games.
Note importantly that the SOPPP model (and the Exp3OE algorithm) can be applied into many problems beyond the CB and HS games, e.g., auctions, recommendation systems. To highlight this and for the sake of conciseness, we first study the generic model of SOPPP in Section 2 and present our second contribution in Section 3, i.e., the Exp3OE algorithm in SOPPPs; we delay the formal definition of the CB and HS games, together with the analysis on running Exp3OE in these games (i.e., our first contribution) to Section 4
. Throughout the paper, we use bold symbols to denote vectors, e.g.,
, and to denote the th element. For any , the set is denoted by and the indicator function of a set is denoted by . For graphs, we write either or to refer that an edge belongs to a path . Finally, we use as a version of the bigO asymptotic notation that ignores the logarithmic terms.2 Path Planning Problems with SideObservations (SOPPP) Formulation
As discussed in Section 1, motivated by the CB and HS games, we propose the path planning problem with semibandit and sideobservations feedback (SOPPP).
SOPPP model. Consider a directed acyclic graph (henceforth, DAG), denoted by , whose set of vertices and set of edges are respectively denoted by and . Let and ; there are two special vertices, a source and a destination, that are respectively called and . We denote by the set of all paths starting from and ending at ; let us define . Each path corresponds to a vector in (thus, ) where if and only if edge belongs to . Let be the length of the longest path in , that is . Given a time horizon , at each (discrete) stage , a learner chooses a path . Then, a loss vector is secretly and adversarially chosen. Each element corresponds to the scalar loss embedded on the edge . Note that we consider the nonoblivious adversary, i.e., can be an arbitrary function of the learner’s past actions , but not .^{2}^{2}2This setting is considered by most of the works in the nonstochastic/adversarial bandits literature, e.g., [2, 10]. The learner’s incurred loss is , i.e., the sum of the losses from the edges belonging to . The learner’s feedback at stage after choosing is presented as follows. First, she receives a semibandit feedback, that is, she observes the edges’ losses , for any belonging to the chosen path . Additionally, each edge may reveal the losses on several other edges. To represent these sideobservations at time , we consider a graph, denoted , containing vertices. Each vertex of corresponds to an edge of the graph . There exists a directed edge from a vertex to a vertex in if, by observing the edge loss , the learner can also deduce the edge loss ; we also denote this by and say that the edge reveals the edge . The objective of the learner is to minimize the cumulative expected regret, defined as .
Hereinafter, in places where there is no ambiguity, we use the term path to refer to a path in and the term observation graphs to refer to . In general, these observation graphs can depend on the decisions of both the learner and the adversary. On the other hand, all vertices in always have selfloops. In the case where none among contains any other edge than these selfloops, no sideobservation is allowed and the problem is reduced to the classical semibandit setting. If all are complete graphs, SOPPP corresponds to the fullinformation PPPs. In this work, we focus on considering the uninformed setting, i.e., the learner observes only after making a decision at time . On the other hand, let us introduce two new notations:
Intuitively, is the set of all paths that, if chosen, reveal the loss on the edge and is the set of all edges whose losses are revealed if the path is chosen. Trivially, . Moreover, due to the semibandit feedback, if , then and . Apart from the results for general observation graphs, in this work, we additionally present several results under two particular assumptions, satisfied by some instances in practice (e.g., the CB and HS games), that provide more refined regret bounds compared to cases that were considered by [19]: symmetric observation graphs where for each edge from to , there also exists an edge from to (i.e., if then ); i.e., is an undirected graph; observation graphs under the following assumption that requires that if two edges belong to a path in , then they cannot simultaneously reveal the loss of another edge.
Assumption : For any , if and , then .
3 Exp3OE  An Efficient Algorithm for the SOPPP
In this section, we present a new algorithm for SOPPP, called Exp3OE (OE stands for Observable Edges), whose pseudocode is given by Algorithm 3. The guarantees on the expected regret of Exp3OE in SOPPP is analyzed in Section 3.2. Moreover, Exp3OE always runs efficiently in polynomial time in terms of the number of edges of ; this is discussed in Section 3.1.
As an Exp3type algorithm, Exp3OE relies on the average weights sampling where at stage we update the weight on each edge by the exponential rule (line ). For each path , we denote the path weight and define the following terms:
(1) 
Line 5 of Exp3OE involves a subalgorithm, called the WPS algorithm, that samples a path with probability (the sampled path is then denoted by ) from any input at each stage . This algorithm is based on a classical technique called weight pushing (see e.g., [28, 16]). We discuss further details and present an explicit formulation of the WPS algorithm in Appendix A).
Compared to other instances of the Exp3type algorithms, Exp3OE has two major differences. First, at each stage , the loss of each edge
is estimated by
(line ) based on the term and a parameter . Intuitively, is the probability that the loss on the edge is revealed from playing the chosen path at . Second, the implicit exploration parameter added to the denominator allows us to “pretend to explore” in Exp3OE without knowing the observation graph before making the decision at stage (the uninformed setting). Unlike the standard Exp3, the loss estimator used in Exp3OE is biased, i.e., for any ,(2) 
Here, denotes the expectation w.r.t. the randomness of choosing a path at stage . Second, unlike standard Exp3 algorithms that keep track and update on the weight of each path, the weight pushing technique is applied at line (via the WPS algorithm) and line (via Algorithm 3.1 in Section 3.1) where we work with edges weights instead of paths weights (recall that ).
3.1 Running Time Efficiency of the Exp3OE Algorithm
In the WPS algorithm mentioned above, it is needed to compute the terms and for any vertex in . Intuitively, is the aggregate weight of all paths from vertex to vertex at stage . These terms can be computed recursively in time based on dynamic programming. This computation is often referred to as weight pushing. Following the literature, we present in Appendix A an explicit algorithm that outputs from any input , called the WP algorithm. Then, a path in is sampled sequentially edgebyedge based on these terms by the WPS algorithm. Importantly, the WP and WPS algorithms run efficiently in time.
The final nontrivial step to efficiently implement Exp3OE is to compute in line , i.e., the probability that an edge is revealed at stage . Note that is the sum of terms; therefore, a direct computation is inefficient while a naive application of the weight pushing technique can easily lead to errors. To compute , we propose Algorithm 3.1, a nonstraightforward application of weight pushing, in which we consecutively consider all the edges . Then, we take the sum of the terms of the paths going through by the weight pushing technique while making sure that each of these terms is included only once, even if has more than one edge revealing (this is a nontrivial step). In Algorithm 3.1, we denote by the set of the direct successors of any vertex . We give a proof that Algorithm 3.1 outputs exactly as defined in line of Algorithm 3 in Appendix B. Algorithm 3.1 runs in time; therefore, line of Algorithm 3 can be done in at most time.
In conclusion, Exp3OE runs in at most time, this guarantee works even for the worstcase scenario. For comparison, the FPLIX algorithm runs in time in expectation and in time with a probability at least for an arbitrary .^{3}^{3}3If one runs FPLIX with Dijkstra’s algorithm as the optimization oracle and with parameters chosen by [19] That is, FPLIX might fail to terminate with a strictly positive probability^{4}^{4}4A stopping criterion for FPLIX can be chosen to avoid this issue but it raises the question on how one chooses the criterion such that the regret guarantees hold. and it is not guaranteed to have efficient running time in all cases. Moreover, although this complexity bound of FPLIX is slightly better in terms of , the complexity bound of Exp3OE improves that by a factor of . As is often the case in noregret analysis, we consider the setting where T is significantly larger than other parameters of the problems; this is also consistent with the motivational applications of the CB and HS games presented in Section 1. Therefore, our contribution in improving the algorithm’s running time in terms of is relevant.
3.2 Performance of the Exp3OE Algorithm
In this section, we present an upperbound of the expected regret achieved by the Exp3OE algorithm in the SOPPP. For the sake of brevity, with defined in (1), for any and , we denote:
Intuitively, is the probability that the chosen path at stage contains an edge and is the summation over all the edges of the ratio of this quantity and the probability that the loss of an edge is revealed (plus ). We can bound the expected regret with this key term .
Theorem 3.1.
The expected regret of the Exp3OE algorithm in the SOPPP satisfies:
(3) 
A complete proof of Theorem 3.1 can be found in Appendix C and has an approach similar to [2, 10] with several necessary adjustments to handle the new biased loss estimator in Exp3OE. To see the relationship between the structure of the sideobservations of the learner and the bound of the expected regret, we look for the upperbounds of in terms of the observation graphs’ parameters. Let be the independence number^{5}^{5}5The independence number of a directed graph is computed while ignoring the direction of the edges. of , we have the following statement.
Theorem 3.2.
Let us define , and . Upperbounds of in different cases of are given in the following table:
satisfies  not satisfies  

Symmetric  
NonSymmetric 
A proof of this theorem is given in Appendix E. The main idea of this proof is based on several graph theoretical lemmas that are extracted from [2, 19, 22]. These lemmas establish the relationship between the independence number of a graph and the ratios of the weights on the graph’s vertices that have similar forms to the keyterm . The case where observation graphs are nonsymmetric and do not satisfy assumption is the most general setting. Moreover, as showed in Theorem 3.2, the bounds of are improved if the observation graphs satisfy either the symmetry condition or assumption . Intuitively, given the same independence numbers, a symmetric observation graph gives the learner more information than a nonsymmetric one; thus, it yields a better bound on and the expected regret. On the other hand, assumption is a technical assumption that allows the use of different techniques in the proofs to obtain better bounds. These cases have not been explicitly analyzed in the literature while they are satisfied by several practical situations, including the CB and HS games (see Section 4).
Finally, we give results on the upperbounds of the expected regret, obtained by the Exp3OE algorithm, presented as a corollary of Theorems 3.1 and 3.2.
Corollary 3.3.
In SOPPP, let be an upper bound of . With appropriate choices of the parameters and , the expected regret of the Exp3OE algorithm is:

in the general cases.

if assumption is satisfied by the observation graphs .
A proof of Corollary 3.3 and the choices of the parameters and (these choices are nontrivial) yielding these results will be given in Appendix F. We can extract from this proof several more explicit results as follows: in the general case, when the observations graphs are nonsymmetric and if they are all symmetric; on the other hand, in cases that all the observation graphs satisfy , if the observations graphs are nonsymmetric and if they are all symmetric.
We note that a trivial upperbound of is the number of vertices of the graph which is (the number of edges in ). In general, the more connected is, the smaller may be chosen; and thus the better upperbound of the expected regret. In the (classical) semibandit setting, and in the fullinformation setting, , . Finally, we also note that, if (this is typical in practice, including the CB and HS games), the bound in Corollary 3.3 matches in order with the bounds (ignoring the logarithmic factors) given by the FPLIX algorithm (see [19]). On the other hand, the form of the regret bound provided by the Exp3IX algorithm (see [19]) does not allow us to compare directly with the bound of Exp3OE in the general SOPPP. Exp3IX is only analyzed by [19] when , i.e., ; in this case, we observe that the bound given by our Exp3OE algorithm is better than that of Exp3IX (by some multiplicative constants).
4 Colonel Blotto Games and HideandSeek Games as SOPPP
Given the regret analysis of Exp3OE in SOPPP, we now return to our main motivation, the Colonel Blotto and the HideandSeek games, and discuss how to apply our findings to these games. To address this, we define formally the online version of the games and show how these problems can be formulated as SOPPP in Sections 4.1 and 4.2, then we demonstrate the benefit of using the Exp3OE algorithm for learning in these games (Section 4.3).
4.1 Colonel Blotto Games as an SOPPP
The online Colonel Blotto game (the CB game). This is a game between a learner and an adversary over battlefields within a time horizon . Each battlefield has a value (unknown to the learner)^{6}^{6}6Knowledge on the battlefields’ values are not assumed lest it limits the scope of application of our model (see e.g., the radio resource allocation problem discussed in Section 1). at stage such that . At stage , the learner needs to distribute troops ( is fixed) towards the battlefields while the adversary simultaneously allocate hers; that is, the learner chooses a vector in the strategy set . At stage and battlefield , if the adversary’s allocation is strictly larger than the learner’s allocation , the learner loses this battlefield and she suffers the loss ; if they have tie allocations, she suffers the loss ; otherwise, she wins and suffers no loss. At the end of stage , the learner observes the loss from each battlefield (and which battlefield she wins, ties, or loses) but not the adversary’s allocations. The learner’s loss at each time is the sum of the losses from all the battlefields. The objective of the learner is to minimize her expected regret. Note that similar to SOPPP, we also consider the nonoblivious adversaries in the CB game.
While this problem can be formulated as a standard OComb, it is difficult to derive an efficient learning algorithm under that formulation, due to the learner’s exponentially large set of strategies that she can choose from per stage. Instead, we show that by reformulating the problem as an SOPPP, we will be able to exploit the advantages of the Exp3OE algorithm to solve it. To do so, first note that the learner can deduce several sideobservations as follows: if she allocates troops to battlefield and wins, she knows that if she had allocated more than troops to , she would also have won; if she knows the allocations are tie at battlefield , she knows exactly the adversary’s allocation to this battlefield and deduce all the losses she might have suffered if she had allocated differently to battlefield ; if she allocates troops to battlefield and loses, she knows that if she had allocated less than to battlefield , she would also have lost.
Now, to cast the CB game as SOPPP, for each instance of the parameters and , we create a DAG such that the strategy set has a onetoone correspondence to the paths set of . Due to the lack of space, we only present here an example illustrating the graph of an instance of the CB game in Figure 1(a) and we give the formal definition of in Appendix G. The graph has edges and paths while the length of every path is . Each edge in corresponds to allocating a certain amount of troops to a battlefield. Therefore, the CB game model is equivalent to a PPP where at each stage the learner chooses a path in and the loss on each edge is generated from the allocations of the adversary and the learner (corresponding to that edge) according to the rules of the game. At stage , the (semibandit) feedback and the sideobservations^{7}^{7}7E.g., in Figure 1(a), if the learner chooses a path going through edge (corresponding to allocating troop to battlefield ) and wins (thus, the loss at edge is ), then she deduces that the losses on the edges , and (corresponding to allocating at least troop to battlefield ) are all . deduced by the learner as described above infers an observation graph . This formulation transforms any CB game into an SOPPP.
Note that since there are edges in that refer to the same allocation (e.g., the edges , and in all refer to allocating troops to battlefield ), in the observation graphs, the vertices corresponding to these edges are always connected. Therefore, an upper bound of the independence number of in the CB game is . Moreover, we can verify that the observation graph of the CB game satisfies assumption for any and it is nonsymmetric.
4.2 HideandSeek Games as an SOPPP
The online HideandSeek game (the HS game). This is a repeated game (within the time horizon ) between a hider and a seeker. In this work, we consider that the learner plays the role of the seeker and the hider is the adversary. There are locations, indexed from to . At stage , the learner sequentially chooses locations (), called an search, to seek for the hider, that is, she chooses (if , we say that location is her th move). The hider maliciously assigns losses on all locations (intuitively, these losses can be the wasted time supervising a mismatch location or the probability that the hider does not hide there, etc.). In the HS game, the adversary is nonoblivious; moreover, in this work, we consider the following condition on how the hider/adversary assigns the losses on the locations:

At stage , the adversary secretly assigns a loss to each location (unknown to the learner). These losses are fixed throughout the search of the learner.
The learner’s loss at stage is the sum of the losses from her chosen locations in the search at stage , that is . Moreover, often in practice the search of the learner needs to satisfy some constraints. In this work, as an example, we use the following constraint: for a fixed (called the coherence constraint), i.e., the seeker cannot search too far away from her previously chosen location.^{8}^{8}8Our results can be applied to HS games with other constraints, such as , i.e., she can only search forward; or, , i.e., she cannot search a location more than times, etc. At the end of stage , the learner only observes the losses from the locations she chose in her search, and her objective is to minimize her total loss over .
Similar to the case of the CB game, tackling the HS game as a standard OComb is computationally involved. As such, we follow the SOPPP formulation instead. To do this, we create a DAG whose paths set has a onetoone correspondence to the set containing all feasible search of the learner in the HS game with locations under coherent constraint. Figure 1(b) illustrates the corresponding graph of an instance of the HS game and we give a formal definition of in Appendix G. The HS game is equivalent to the PPP where the learner chooses a path in and edges’ losses are generated by the adversary at each stage (note that to ensure all paths end at , there are auxiliary edges in that are always embedded with losses). Note that there are edges and paths in . Moreover, knowing that the adversary follows condition , the learner can deduce the following sideobservations: within a stage, the loss at each location remains the same no matter when it is chosen among the search, i.e., knowing the loss of choosing location as her th move, the learner knows all the loss if she chooses location as her th move for any . The semibandit feedback and sideobservations as described above generate the observation graphs (e.g., in Figure 1(b), the edges , and represent that location is chosen; thus, they mutually reveal each other). The independence number of is for any . The observation graphs of the HS game are symmetric and do not satisfy . Finally, we consider a relaxation of condition :

At stage , the adversary assigns a loss on each location . For , after the learner chooses, say location , as her th move, the adversary can observe that and change the losses for any location that has not been searched before by the learner,^{9}^{9}9An interpretation is that by searching a location, the learner/seeker “discovers and secures” that location; therefore, the adversary/hider cannot change her assigned loss at that place. i.e., she can change the losses .
By replacing condition with condition , we can limit the sideobservations of the learner: she can only deduce that if , the edges in representing choosing a location as the move reveals the edges representing choosing that same location as the th move; but not vice versa. In this case, the observation graph is nonsymmetric; however, its independence number is still as in the HS games with condition .
4.3 Performance of Exp3OE in the Colonel Blotto and HideandSeek Games
Having formulated the CB game and the HS game as SOPPPs, we can use the Exp3OE algorithm in these games. From Section 3.1 and the specific graphs of the CB and HS game, we can deduce that Exp3OE runs in at most time. We remark again that Exp3OE’s running time is linear in and efficient in all cases unlike when we run FPLIX in the CB and HS games. Moreover, we can deduce the following result directly from Corollary 3.3:
Corollary 4.1.
The expected regret of the Exp3OE algorithm satisfies:

in the CB games with troops and battlefields.

in the HS games with locations and search.
At a highlevel, given the same scale on their inputs, the independence numbers of the observation graphs in HS games are smaller than in CB games (by a multiplicative factor of ). However, since assumption is satisfied by the observation graphs of the CB games and not by the HS games, the expected regret bounds of the Exp3OE algorithm in these games have the same order of magnitude. From Corollary 4.1, we note that in the CB games, the order of the regret bounds given by Exp3OE is better than that of the FPLIX algorithm (thanks to the fact that is satisfied).^{10}^{10}10More explicitly, in the CB game, FPLIX has a regret at most (C is a constant indicated by [19]) and Exp3OE’s regret bound is (if , we can rewritten this bound as ). On the other hand, in the HS games with , the regret bounds of the Exp3OE algorithm improves the bound of FPLIX but they are still in the same order of the games’ parameters (ignoring the logarithmic factors).^{11}^{11}11More explicitly, in HS games with , FPLIX’s regret is and Exp3OE’s regret is (similar results can be obtained for the HS games with ). Note that the the regret bound of Exp3OE in the HS game with Condition (involving symmetric observation graphs) is slightly better than that in the HS game with Condition .
We also conducted several numerical experiments that compares the running time and the actual expected regret of Exp3OE and FPLIX in CB and HS games. The numerical results are in consistent with theoretical results in this work. Our code for these experiments can be found at https://github.com/dongquan11/CBHS.SOPPP.
Finally, we compare the regret guarantees given by our Exp3OE algorithm and by the OSMD algorithm (see [3])—the benchmark algorithm for OComb with semibandit feedback (although OSMD does not run efficiently in general): Exp3OE is better than OSMD in CB games if ; in HS games if and in the HS games with condition if . We give a proof of this statement in Appendix H. Intuitively, the regret guarantees of Exp3OE is better than that of OSMD in the CB games where the learner’s budget is sufficiently larger than the number of battlefields and in the HS games where the total number of locations is sufficiently larger than the number of moves that the learner can make in each stage.
5 Conclusion
In this work, we introduce the Exp3OE algorithm for the path planning problem with semibandit feedback and sideobservations. Exp3OE is always efficiently implementable. Moreover, it matches the regret guarantees compared to that of the FPLIX algorithm (Exp3OE is better in some cases). We apply our findings to derive the first solutions to the online version of the Colonel Blotto and HideandSeek games. This work also extends the scope of application of the PPP model in practice, even for large instances.
Acknowledgment:
This work was supported by ANR through the “Investissements d’avenir” program (ANR15IDEX02) and grant ANR16TERC0012; and by the Alexander von Humboldt Foundation.
References
 [1] (2015) Online learning with feedback graphs: beyond bandits. In JMLR Workshop and Conference Proceedings, Vol. 40. Cited by: footnote 1.
 [2] (2013) From bandits to experts: a tale of domination and independence. In Advances in Neural Information Processing Systems, pp. 1610–1618. Cited by: Appendix D, Appendix E, §3.2, §3.2, footnote 1, footnote 2.
 [3] (2014) Regret in online combinatorial optimization. Mathematics of Operations Research 39 (1), pp. 31–45. Cited by: §4.3.
 [4] (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.

[5]
(2017)
Faster and simpler algorithm for optimal strategies of blotto game.
In
Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI17)
, pp. 369–375. Cited by: §G.1, §1. 
[6]
(2014)
Surveillance for security as a pursuitevasion game.
In
International Conference on Decision and Game Theory for Security
, pp. 370–379. Cited by: §1.  [7] (2013) Audit games. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI13), pp. 41–47. Cited by: §1.
 [8] (1921) La théorie du jeu et les équations intégrales à noyau symétrique. Comptes rendus de l’Académie des Sciences 173 (13041308), pp. 58. Cited by: §1.
 [9] (2005) From resource allocation to strategy. Oxford University Press. Cited by: §1.
 [10] (2012) Combinatorial bandits. Journal of Computer and System Sciences 78 (5), pp. 1404–1422. Cited by: §1, §3.2, footnote 2.

[11]
(2013)
Combinatorial multiarmed bandit: general framework and applications.
In
Proceedings of the 30th International Conference on Machine Learning (ICML1
, pp. 151–159. Cited by: §1.  [12] (2012) Colonel Blotto in web security. In The 11th Workshop on Economics and Information Security, WEIS Rump Session, pp. 141–150. Cited by: §1.
 [13] (2019) Stochastic asymmetric blotto game approach for wireless resource allocation strategies. IEEE Transactions on Wireless Communications. Cited by: §1.
 [14] (1997) A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §1.
 [15] (1950) A continuous colonel blotto game. Technical report RAND project air force Santa Monica CA. Cited by: §1.
 [16] (2007) The online shortest path problem under partial monitoring. Journal of Machine Learning Research 8 (Oct), pp. 2369–2403. Cited by: Appendix A, §3.
 [17] (2000) Probabilistic pursuitevasion games: a onestep nash approach. In Proceedings of the 39th IEEE Conference on Decision and Control, pp. 2272–2277. Cited by: §1.
 [18] (2005) Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71 (3), pp. 291–307. Cited by: §1.
 [19] (2014) Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pp. 613–621. Cited by: Appendix D, §1, §2, §3.2, §3.2, footnote 1, footnote 10, footnote 3.
 [20] (2010) Complexity of computing optimal stackelberg strategies in security resource allocation games. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI10), pp. 805–810. Cited by: §1.
 [21] (2012) Coalitional Colonel Blotto games with application to the economics of alliances. Journal of Public Economic Theory 14 (4), pp. 653–676. External Links: Document Cited by: §1.
 [22] (2011) From bandits to experts: on the value of sideobservations. In Advances in Neural Information Processing Systems, pp. 684–692. Cited by: Appendix D, §3.2, footnote 1.
 [23] (2014Sep.) Strategic resource allocation for competitive influence in social networks. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 951–958. Cited by: §1.
 [24] (201512) Defensive resource allocation in social networks. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), Vol. , pp. 2927–2932. Cited by: §1.
 [25] (2006) The Colonel Blotto game. Economic Theory 29 (1), pp. 2–24. External Links: Document Cited by: §1, §1.
 [26] (2018) Efficient bandit combinatorial optimization algorithm with zerosuppressed binary decision diagrams. In International Conference on Artificial Intelligence and Statistics, pp. 585–594. Cited by: Appendix A.
 [27] (2014) The heterogeneous Colonel Blotto game. In Proceedings of the 7th International Conference on Network Games, Control and Optimization (NetGCoop), pp. 232–238. Cited by: §1, §1.
 [28] (2003) Path kernels and multiplicative updates. Journal of Machine Learning Research 4 (Oct), pp. 773–818. Cited by: Appendix A, §3.
 [29] (2002) Probabilistic pursuitevasion games: theory, implementation, and experimental evaluation. IEEE transactions on robotics and automation 18 (5), pp. 662–669. Cited by: §1.
 [30] (1953) A certain zerosum twoperson game equivalent to the optimal assignment problem. Contributions to the Theory of Games 2, pp. 5–12. Cited by: §1.
 [31] (201807) Efficient computation of approximate equilibria in discrete colonel blotto games. In Proceedings of the 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence (IJCAIECAI), pp. 519–526. Cited by: §1.
 [32] (2016) Learning in hideandseek. IEEE/ACM Transactions on Networking 24 (2), pp. 1279–1292. Cited by: §1.
 [33] (1987) Pursuit–evasion differential games with deception or interrupted observation. In PursuitEvasion Differential Games, pp. 191–203. Cited by: §1.
 [34] (2009) A survey of spectrum sensing algorithms for cognitive radio applications. IEEE communications surveys & tutorials 11 (1), pp. 116–130. Cited by: §1.
 [35] (2009) A multiagent learning approach to online distributed resource allocation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, Cited by: §1.
Appendix
Appendix A Weight Pushing for Path Sampling
We revisit some useful results in the literature. In this section, we consider a DAG with parameters as introduced in Section 2. For simplicity, we assume that each edge in belongs to at least one path in . Let us respectively denote by and the set of the direct successors and the set of the direct predecessors of any vertex . Moreover, let and respectively denote the edge and the set of all paths from vertex to vertex .
Let us consider a weight for each edge . It is needed in the Exp3OE algorithm to sample a path with the probability:
(4) 
A direct computation and sampling from takes time which is very inefficient. To efficiently sample the path, we first label the vertices set by such that if there exists an edge connecting to then . We then define the following terms for each vertex :
Intuitively, is the aggregate weight of all paths from vertex to vertex and is exactly the denominator in (4). These terms and can be recursively computed by the WP algorithm (i.e., Algorithm 1) that runs in time, through dynamic programming. This is called weight pushing and it is used by [16, 26, 28].
Based on the WP algorithm (i.e., Algorithm 1), we construct the WPS algorithm (i.e., Algorithm 2) that uses the weights as inputs and randomly outputs a path in . Intuitively, starting from the source vertex , Algorithm 2 sequentially samples vertices by vertices based on the terms computed by Algorithm 1. It is noteworthy that Algorithm 2 also runs in time and it is trivial to prove that the probability that a path is sampled from Algorithm 2 matches exactly .
Appendix B Proof of Algorithm 3.1’s Output
Proof.
Fixing an edge , we prove that when Algorithm 3.1 takes the edges weights as the input, it outputs exactly . We note that if , then .
We denote and label the edges in the set by . The forloop in lines  of Algorithm 3.1 consecutively run with the edges in as follows:

After the forloop runs for , we have ; therefore, since computed from the original weights . Due to line that sets , henceforth in Algorithm 3.1, the weight of any path that contains is set to .

Let the forloop run for , we have because any path has the weight . Therefore, .

Similarly, after the forloop runs for (where ), we have:

Therefore, after the forloop finishes running for every edge in ; we have where each term was only counted once even if contains more than one edge that reveals the edge .
∎
Appendix C Proof of Theorem 3.1
See 3.1
Proof.
We first denote^{12}^{12}12We recall that . . From line of Algorithm 3, we trivially have:
(5) 
We recall that and the notation denoting the expectation w.r.t. to the randomness in choosing in Algorithm 3 (i.e., w.r.t. the information up to time ). From (2), we have:
(6) 
Under the condition that , we obtain:
Comments
There are no comments yet.