1 Introduction
Bellman’s optimality principle (Bellman’s OP) [3] led to stateoftheart solvers in many nontrivial sequential decisionmaking problems, assuming partial observability [25], multiobjective criteria [29, 21], collaborating agents, e.g.
, modeled as decentralized partially observable Markov decision processes (DecPOMDPs)
[13, 30, 9], or some noncollaborative perfect information games (from Shapley’s seminal work [26] to [6]). In all these settings this principle exploits the fact that subproblems are nested recursively within the original problem. An open question is whether—and how—it could be applied to imperfect information games, which are encountered in diverse applications such as Poker [18] or security games [1]. This paper answers this question in the setting of 2player zerosum partially observable stochastic games (zsPOSGs), i.e., imperfect information games with simultaneous moves, perfect recall, discounted rewards and a possibly infinite time horizon.As general POSGs and DecPOMDPs, infinitehorizon zsPOSGs are undecidable, and their finitehorizon approximations are in NEXP [22, 4]. As further discussed in Section 2, solution techniques for finitehorizon POSGs, or other impartial information games that can be formulated as extensiveform games (EFGs), typically solve an equivalent normalform game [27] or use a dedicated regretminimization mechanism [34, 5]. They thus do not rely on Bellman’s optimality principle, except (i) a dynamic programming approach that only constructs sets of nondominated solutions [13], (ii) in collaborative problems (Decentralized POMDPs), adopting the viewpoint of a (blind) central planner [30, 9], and (iii) for (mostly 2player zerosum) settings with observability assumptions such that one can reason on player beliefs [12, 7, 2, 15, 8, 14]. Here, we do not make any assumption beyond the game being 2player zerosum, in particular regarding observability of the state and actions.
As for a number of DecPOMDP solvers, our approach adopts the viewpoint not of a player, but of a central (offline) planner that prescribes individual strategies to the players [30], which allows turning a zsPOSG into a nonobservable game for which Bellman’s optimality principle applies. This is achieved in Section 4 (after background Section 3) while reasoning not on a player’s belief over the game state (as feasible in POMDPs or some particular games), but on the central planner’s (blind) belief, a statistic called occupancy state and that we prove to be sufficient for optimal planning, as Dibangoye et al. did for DecPOMDPs [9]. In Section 5, our Bellman/Shapley operator is proved to induce an optimal game value function that is Lipschitzcontinuous in occupancy space, which leads to deriving value function approximators, including upper and lowerbounding ones, and discussing their initialization. Finally, Section 6 describes a variant of HSVI for zsPOSGs, and demonstrates its finitetime convergence to an optimal solution despite the continuous (occupancy) state and action spaces.
2 Related Work
Infinite horizon POSGs are undecidable [22], which justifies searching for nearoptimal solutions, e.g., through finite horizon solutions, as we will do. There is little work on solving POSGs, in particular through exploiting Bellman’s optimality principle. One exception is Hansen and Zilberstein’s work on finite horizon POSGs [13] , where dynamic programming (DP) incrementally constructs nondominated policy trees for each player, which allows then deriving a solver for commonpayoff POSGs, i.e., decentralized partially observable Markov decision processes (DecPOMDPs). Here, Bellman’s OP thus serves as a preprocessing phase, while we aim at employing it in the core of algorithms.
DecPOMDPs
Bellman’s OP appears as the core component of a DecPOMDP solver when Szer et al. [30] adopt a plannercentric viewpoint whereby the planner aims at providing the players with their private policies without knowing which actionobservation histories they have experienced. The planner’s information state at thus contains the initial belief and the joint policy up to . This leads to turning a DecPOMDP into an informationstate MDP, and obtaining a deterministic shortest path problem that can be solved using an A* search called MAA* (multiagent A*).
Then, another important step is when Dibangoye et al. [9] show that (i) the occupancy state, a statistic used to compute expected rewards in MAA*, is in fact sufficient for planning, and (ii) the optimal value function is piecewise linear and convex (PWLC) in occupancy space, which allows adapting pointbased POMDP solvers using approximators of .
Subclasses of POSGs
Recent works addressed particular cases of discounted partially observable stochastic games (POSGs), 2player and zerosum if not specified otherwise, exploiting the structure of the problem to turn it into an equivalent problem for which Bellman’s principle applies. Ghosh et al. [12] considered POSGs with public actions and shared observations, which can be turned into stochastic games defined over the common belief space, similarly to POMDPs turned into belief MDPs. Chatterjee and Doyen [7], Basu and Stettner [2], and Horák et al. [15] considered OneSided POSGs, i.e., scenarios where (player) (w.l.o.g.) only partially observes the system state, and has access to the system state, plus the action and observation of . Cole and Kocherlakota [8] considered (player) POSGs with independent private states, partially shared observability, and ’s utility function depending on his private state and on the shared observation. Horák and Bošanský [14] considered zsPOSGs with independent private states and public observations, i.e., scenarios where (i) each player has a private state he fully observes, and (ii) both players receive the same public observations of each player’s private state. Any player’s belief over the other player’s private state is thus common knowledge.
Focusing on the work of Horák et al. [15, 14], in both cases convexity or concavity properties of the optimal value function are obtained, which allow deriving upper and lowerbounding approximators. These approximators are then employed in HSVIbased algorithms. Yet, moving from MDPs and POMDPs (as in Smith’s work) to these settings induces a tree of possible futures with an infinite branching factor, which requires changes to the algorithm, and thus to the theoretical analysis of the finitetime convergence. As we shall see, the present work adopts similar changes.
Wiggers et al. [32] prove that, using appropriate representations, the value function associated to a zsPOSG is convex for (maximizing) player and concave for (minimizing) player . Yet, this did not allow deriving a solver based on approximating the value function. Here, we exploit no convexity or concavity property of the optimal value function, as they may not hold, but its Lipschitz continuity.
Imperfect Information Games
Finite horizon (generalsum) POSGs can be written as extensiveform games with imperfect information and perfect recall (EFGs, often referred to as imperfect information games) [24], which makes solution techniques for EFGs relevant even for infinitehorizon POSGs. A first approach to solving EFGs is to turn them into a normalform game before looking for a Nash equilibrium, thus ignoring the temporal aspect of the problem [27]
and inducing a combinatorial explosion. For (2player) zsEFGs, this leads to solving two linear programs (one for each player). Koller and Megiddo
[16] propose a different linear programming approach for zsEFGs that exploits the temporal aspect through the choice of decision variables, but still does not apply Bellman’s OP (see also [31, 17]).More recently, Counterfactual Regret minimization (CFR) [34] has been introduced, allowing to solve large imperfectinformation games with bounded regret such as headsup no limit hold’em poker, now winning against top human players [5]. While some CFRbased algorithms use heuristicsearch techniques, thus somehow exploit the sequentiality of the game, they do not rely on Bellman’s OP either.
3 Background
For the sake of clarity, the concepts and results of the EFG literature used in this work will be recast in the POSG setting. We shall employ the terminology of pure/mixed/behavioral strategies and strategy profiles—more convenient in our noncollaborative setting—instead of deterministic or stochastic policies (private or joint ones)—common in the collaborative setting of DecPOMDPs.
A (2player) zerosum partially observable stochastic game (zsPOSG) is defined by a tuple , where

is a finite set of states;

is (player) ’s finite set of actions;

is ’s finite set of observations;

is the probability to transition to state
and receive observations and when actions and are performed in state ; 
is a (scalar) reward function;

is a temporal horizon;

is a discount factor; and

is the (public/common) initial belief state.
would like to maximize the expected return, defined as the discounted sum of future rewards, while would like to minimize it, what we formalize next.
From the DecPOMDP, POSG and EFG literature, we use the following concepts and definitions, where :

is ’s opponent.

is a length actionobservation history for . The set of histories is , with one subset per time step.

is a joint history at time . The set of joint histories is , with one subset per time step.
 []

An occupancy state at time
is a probability distribution over state–jointhistory pairs
. ( is completely specified by .) The set of occupancy states is , with one subset per time step. Note that this notion applies to POSGs despite the use of stochastic actions.  []

A pure strategy for is a mapping from private histories in () to single private actions. By default, .

is a pure strategy profile.
 []

A mixed strategy for is a probability distribution over pure strategies. It is used by first sampling one of the pure strategies (at ), and then executing that strategy until .

is a mixed strategy profile.
 []

A (behavioral) decision rule at time for is a mapping from private histories in to distributions over private actions. For convenience, we will note the probability to pick action when facing history .

is a decision rule profile (, and noting ).

is a behavioral strategy for from time step to (included). By default, .

is a behavioral strategy profile.
 []

The value of a behavioral strategy profile in occupancy state (from time step on) is:
where
is the random variable associated to the instant reward at time step
. [Note: This definition extends naturally to pure and mixed strategy profiles.]
The primary objective is here to find a Nash equilibrium strategy (NES), i.e., a mixed strategy profile such that no player has an incentive to deviate, which can be written:
In such a 2player zerosum game, all NESs have the same Nashequilibrium value (NEV) .
Finite horizon POSGs being equivalent to EFGs with imperfect information and perfect recall, the following key result for EFGs applies to (finite ) POSGs:
4 Solving POSGs as Occupancy MGs
In this section, unless stated otherwise, we assume finite horizons and exact solutions (no error).
Here, we show (i) how a zsPOSG can be reformulated as a different zerosum Markov game, and (ii) that Bellman’s optimality principle applies in this game.
4.1 From zsPOSGs to zsOMGs
To solve a zsPOSG, we take the viewpoint of a central planner that searches offline for the best behavioral strategy profile before providing it to the players. This contrasts with DecPOMDPs where deterministic strategy profiles suffice, and means exploring a (bounded) continuous space rather than a (finite) discrete one as for DecPOMDPs. Such a planner grows a partial strategy by appending a decision rule profile .
Note that any partial strategy is in onetoone correspondence with an occupancy state . So, the controlled process induced in occupancy space, where actions are decision rule profiles, is both deterministic and Markovian (see formal details about the dynamics below): applying in (i.e., appending it to ) leads to a unique . Also, the expected reward at time is linear in occupancy space (more precisely in the corresponding distribution over states). All this allows reasoning not on partial behavioral strategy profiles, but on occupancy states. The central planner will thus (i) infer occupancy states seen as “beliefs” over the possible situations (“situation” here meaning the current state and the players’ joint actionobservation history ) which may have been reached, although without knowing what actually happened, and (ii) map each occupancy state to a decision rule profile telling the players how to act depending on their actual actionobservation histories.^{1}^{1}1In contrast, in a POMDP, the belief state depends on the agent’s actionobservation history, and is mapped to a single action. Each zsPOSG is thus turned into an equivalent game, called a zerosum occupancy Markov game (zsOMG)^{2}^{2}2We use (i) “Markov game” instead of “stochastic game” because the dynamics are not stochastic, and (ii) “partially observable stochastic game” to stick with the literature. formally defined by the tuple , where:

is the set of occupancy states induced by the zsPOSG;

is the set of decision rule profiles of the zsPOSG;

is a reward function naturally induced from the zsPOSG as the expected reward for the current occupancy state and decision rule profile:
we use the same notation for zsPOSGs as the context shall indicate which one is discussed;

, , and are as in the zsPOSG.
Note first that, for convenience, we directly consider behavioral decision rules, which correspond to mixed strategies. Of course, at , ’s possible actions should be decision rules defined over histories that have nonzero probability in current . The dynamics being deterministic and the actions public, both players of that new game (also denoted and while these are different players) know the next state after each transition. But this is no standard zs Markov game also since (i) the mixture of two actions is equivalent to another action already in the (continuous) action space at hand, and (ii) at each time step, the state (occupancy) space is continuous.
We shall study the subgames of a zsOMG, i.e., situations where some occupancy state has somehow been reached at time step , and the central solver is looking for rational strategies ( and ) to provide to the players. tells which actionobservation histories each player could be facing with nonzero probability, and thus which are relevant for planning. We can then extend the definition of value function from time step only to any time step as follows (using behavioral strategies):
Note that is in onetoone relationship with a strategy profile , so that we can denote its concatenation with a .
4.2 Back to Mixed Strategies
We now reintroduce, and generalize, mixed strategies as a mathematical tool to handle subgames of a zsOMG as normalform games, and give some preliminary results.
For a given , let be an arbitrarily chosen mixed strategy profile that leads to (/is compatible with) , thus defined over time interval . To complete this mixed (prefix) strategy, the central planner should provide each player with a different (suffix) strategy to execute for each it could be facing. We now detail how to build an equivalent set of mixed (full) strategies for . Each of the pure (prefix) strategies used in (belonging to a set denoted ) can be extended by appending a different pure (suffix) strategy at each of its leaf nodes, which leads to a large set of pure strategies . Then, let be the set of mixed (full) strategies obtained by considering the distributions over that verify, ,
(1) 
This is the set of mixed strategies compatible with .
[Proof in App. A.2]lemmalemEquivalence (originally stated on page 4.2) is convex and equivalent to the set of behavioral strategies , thus sufficient to search for a Nash equilibrium in .
While only future rewards are relevant when making a decision at , reasoning with mixed strategies defined from will be convenient because is linear in , which allows coming back to a standard normalform game and applying known results.
In the remaining, we simply note (without index) the mixed
strategies in , set which we now note
.
Also, since we shall work with local game
, let us
define:
the set of ’s mixed
strategies compatible with occupancy states reachable given
and (with either
or ).
Then,
(inclusion due to the latter sets being less constrained in their
definition).
As a consequence, if maximizing some function over ’s mixed
strategies compatible with a given :
As can be easily demonstrated (cf. Lemma 2 in App. A.3), any Nash equilibrium solution of our original game induces a Nash equilibrium in any of its reachable subgames.^{3}^{3}3In contrast, a subgame perfect equilibrium requires a Nash equilibrium in any subgame reachable by some strategy profile, which is more constraining. But this does not tell whether Bellman’s optimality principle applies, what we discuss next.
4.3 Bellman’s Optimality Principle
For any and , let us define (i) a NE profile for the subgame at , (ii) the NE value of the subgame at any , and (iii) the local subgame at
Then, given Nash equilibrium solutions for any , the applicability of Bellman’s optimality principle shall be proved if a Nash equilibrium of can be found by (i) solving the local subgame to get a decision rule profile and (ii) appending it to .
An AbnormalForm Game?
A first question is whether this game is in fact a normalform game, i.e., whether it could be defined by a payoff matrix over pure decision rules, payoffs for behavioral decision rules being obtained through linear mixtures.
is linear in each player’s decision rule space at each time step (i.e., in for any and ), but multilinear in each player’s behavioral strategy space (see Lemma 1 App. A.4.1), which suggests that may not be convexconcave (and thus not (bi)linear) in the space of decision rules at . As a consequence, we are possibly facing an abnormalform game and cannot use von Neumann’s Minimax theorem.
Properties of the Maximin and Minimax Values
Rather than digging the convexityconcavity property further, we now show that computing the maximin and minimax values of induces finding a NE of given NEs for any .
[Proof in App. A.4.2]theoremlemAbnormalMaximinimax (originally stated on page 4.3) In the 2p zs abnormalform game , the maximin and minimax values are both equal to —i.e., as previously defined, the NEV for game —and correspond to a NES.
Proof.
(sketch) The proof relies on first developing the maximin of , then using (i) the equivalence of maximin and minimax for mixed strategies (when von Neumann’s minimax theorem applies), and (ii) the equivalence of mixed and behavioral strategies. ∎
Maximin and Minimax Computation
The last results tell us that we can exploit knowledge of the optimal value function at (for all ) to find optimal decision rules at for any given by computing the maximin and minimax values of the local (abnormalform) game at hand. Yet, we cannot use an LP as for normalform games. To find an appropriate solution method, let us now look at properties of this game, noting that we lack any convexity/concavity property, and presenting a preliminary result.
[Proof in App. A.4.3]lemmalemOccLin (originally stated on page 4.3) At depth , is linear in , , and , where . It is more precisely Lipschitzcontinuous in (in norm), i.e., for any , :
The Lipschitz continuity (LC) property would also hold in norm or norm, due to the equivalence between norms, but with different constants.
[Proof in App. A.4.3]lemmalemQLC (originally stated on page 4.3) For any and , is Lipschitz continuous in both and .
The payoff function of our game is thus LC in each private decisionrule space, which suggests using errorbounded global optimization techniques, as Munos’s DOO (Deterministic Optimistic Optimization) [23]. Here, searching for a maximin (resp. minimax) value suggests using two nested optimization processes: an “outer” one for the (resp. ) operator, and an “inner” one for the (resp. ). To ensure being within of the maximin value, each process could, for example, use an tolerance threshold. Yet, in such a nested optimization process, the inner process may stop, at each call, before reaching optimality if it leads the outer process to explore a different point.
Due to the continuous state space of zsOMGs, cannot be computed exactly. We shall now see how to approximate it, before exploiting the resulting approximators in a specific version of HSVI in Sec. 6.
5 Properties of
In this section, we again assume finite horizon problems (unless stated otherwise). The main objective here is to propose upper and lowerbounding approximators that exploit ’s Lipschitz continuity (rather than PWLC) property, as Fehr et al. [10] did in the setting of (single agent) informationoriented control, but here with simpler derivations.
5.1 FiniteHorizon Lipschitz Continuity of
The following lemma proves that the expected instant reward at any is linear in , and thus so is the expected value of a finitehorizon strategy profile from onwards (trivial proof by induction). [Proof in App. A.5.1]lemmalemVLinOcc (originally stated on page 5.1) At depth , is linear w.r.t. .
[Proof in App. A.5.1]corollarycorVLCOcc (originally stated on page 5.1) is Lipschitz continuous in at any depth .
Refining the Lipschitz constant(s)
We have just discussed the LC of based on the LC of finitehorizon strategies, reasoning on worst case Lipschitz constants (one per time step) that hold for all strategies. Now, (i) could we refine those constants based on knowledge regarding , in particular upper and lower bounds and (see next sections)? And (ii) could we make use of those refined constants in the planning process?
Regarding question (i), and tell us that any strategy profile from time on (and thus with remaining horizon ) has values within and , hence the refined Lipschitz constant:
Regarding question (ii), as and are refined during the planning process, these refined depthdependent constants would progressively shrink, thus speeding up planning! This phenomenon could encourage improving the value function bounds where they seem high (for ) or low (for ).
5.2 Approximating
Note: For the sake of readability, the depth index may be omitted when it can be inferred from the occupancy state.
Approximators
An HSVIlike algorithm requires maintaining both an upper and a lower approximator of . We denote them and , and .
The LC of suggests employing LC function approximators for at depth () in the form of a lower envelope of (i) an initial upperbound and (ii) downwardpointing L1cones, where an upperbounding cone —located at , with “summit” value , and slope —induces a function . The upperbound is thus defined as the lower envelope of and the set of cones , i.e.,
Respectively, for the lowerbounding approximator at depth : a lowerbounding (upwardpointing) cone induces a function ; and the lower bound is defined as the upper envelope of an initial lower bound and the set of cones , i.e.,
(Pointbased) Operator and Value Updates
One cannot apply an operator (noted ) to update a value function approximator uniformly. Instead, when visiting some occupancy state (at depth ), we perform a pointbased update of the upperbound by (i) finding the NEV of the following game (which relies on at ):
then (ii) adding a downwardpointing cone to . We note the upper bound after this update at point . The same applies to with upwardpointing cones instead, and using notation .
5.3 Initializations
Due to the symmetry between players in a zsPOSG, without loss of generality, let us look for an upper bound of the optimal value function , i.e., an optimistic bound (an admissible heuristic) for (maximizing) player 1. A usual approach to look for optimistic bounds is to relax the problem for the player at hand. To that end, one can here envision manipulating the players’ knowledge, their control over the system, the action ordering, or the opponent’s objective, e.g.:

providing more (e.g. full) state observability to 1;

providing less (e.g. no) state observability to 2;

letting 1 know what 2 observes;

letting 1 control chance (2’s choice would then only restrict the set of reachable states), but this would require that 1 has full observability;

letting 2 act first, and telling 1 about 2’s selected action (exactly or through a partial observation);

turning 2 into a collaborator by making him maximize, rather than minimize, the expected return.
Accounting for related Markov models for sequential decisionmaking, this suggests turning the zsPOSG at hand for example into:

a DecPOMDP by turning the opponent into a collaborator (or even into a POMDP or an MDP); or

a OneSided POSGs [15] by combining (i) full state observability, (ii) observability of 2’s observation, and (iii) observability of 2’s action.
Note that making both players’ actions or observations public (as in POPOSGs [14]) would not be a viable solution as this would imply providing more knowledge to both players at the same time, which may prevent the resulting optimal value function from being an upper bound for our problem.
6 HSVI for zsPOSGs
In this section, we assume infinite horizon problems (unless stated otherwise) and optimal solutions.
6.1 Algorithm
As we shall see, optimally solving an horizon zsPOSG amounts, as often, to solving a problem with finite horizon , which allows exploiting the results derived up to now. For convenience, we assume already known and use horizondependent constants (e.g., Lipschitz constants).
HSVI for zsOMGs is detailed in Algorithm 1. As vanilla HSVI, it relies on (i) generating trajectories while acting optimistically (lines 1–1), i.e., player (resp. ) acting “greedily” w.r.t. (resp. ), and (ii) locally updating the upper and lowerbounding approximators (lines 1 and 1). Here, computations of value updates and strategies rely on solving our local zerosum abnormal form games (possibly a maximin/minimax optimization exploiting the Lipschitz continuity as discussed in Sec. 4.3). A key difference lies in the criterion for stopping trajectories. In vanilla HSVI (for POMDPs), the finite branching factor allows looking at the convergence of and at each point reachable under an optimal strategy. To ensure convergence at , trajectories just need to be interrupted when the current width at (, where ) is smaller than a threshold . Here, dealing with an infinite branching factor, one may converge towards an optimal solution while always visiting new points of the occupancy space. Yet, as the sequence of generated (deterministic) trajectories converges to an optimal trajectory, the density of visited points around it increases, so that the Lipschitz approximation error tends to zero. One can thus bound the width within balls around visited points by exploiting the Lipschitz continuity of the optimal value function. As proposed by Horák et al. [15], this is achieved by adding a term to ensure that the width is below within a ball of radius around the current point (here the occupancy state ). Hence the threshold
(2) 
Setting
As can be observed, this threshold function should always return positive values, which requires a small enough . For a given problem, the maximum possible value shall depend on the Lipschitz constants at each time step, which themselves depend on the upper and lower bounds of the optimal value function (and thus may evolve during the planning process). For the sake of simplicity, let us consider a single Lipschitz constant common to all time steps, which always exists.
[Proof in App. A.6]lemmalemMaxRadius (originally stated on page 6.1) Assuming a single depthindependent Lipschitz constant , and noting that
(3) 
one can ensure positivity of the threshold at any by enforcing
We shall thus pick in . But what is the effect of setting to small or large values?

The smaller , the larger , the shorter the trajectories, but the smaller the balls and the higher the required density of points around the optimal trajectory, thus the more trajectories needed to converge.

The larger , the smaller , the longer the trajectories, but the larger the balls and the lower the required density of points around the optimal trajectory, thus the less trajectories needed to converge.
So, setting means making a compromise between the number of generated trajectories and their length.
6.2 FiniteTime Convergence
First, the following result bounds the length of HSVI’s trajectories using the bounded width of and the exponential growth of .
[Proof in App. A.6]lemmalemFiniteTrials (originally stated on page 6.2) Assuming a depthindependent Lipschitz constant , and with , the length of trajectories is upperbounded by
Note that (i) the classical upperbound is retrieved when (Eq. (6.7) in [28]), and (ii) this gives us the maximum horizon needed to solve the problem. Now, knowing that any trial terminates in bounded time allows deriving the following results, in order.
[Proof in App. A.6]theoremlemThr (originally stated on page 6.2) Consider a trial of length and consider that the backward updates of and have not yet been performed. Then

, and

for every satisfying , it holds:
Theorem 2.
Algorithm 1 terminates with an approximation of .
Proof.
(Adapted from [14]) Assume for the sake of contradiction that the algorithm does not terminate and generates an infinite number of explore trials. Since the length of a trial is bounded by a finite number , the number of trials of length (for some ) must be infinite. It is impossible to fit an infinite number of occupancy points satisfying within . Hence there must be two trials of length , and , such that . Without loss of generality, assume that was visited the first. According to Lemma 6.2, the pointbased update in resulted in —which contradicts that the condition on line 1 of Algorithm 1 has not been satisfied for (and hence that was a trial of length ). ∎
Note that the number of trials could be (tediously) upperbounded by determining how many balls of radius are required to cover occupancy simplexes at each depth.
7 Discussion
Inspired by techniques solving POMDPs as belief MDPs or DecPOMDPs as occupancy MDPs, we have demonstrated that zsPOSGs could be turned into a new type of sequential game, namely zsOMGs, allowing to apply Bellman’s optimality principle. Value function approximators (with heuristic initializations) can be used thanks to the Lipschitz continuity of , and despite possibly not being concave or convex in any relevant statistic. A variant of HSVI has been derived which provably converges in finite time to an optimal solution.
This approach was motivated by the fact that the corresponding techniques for POMDPs and DecPOMDPs provide stateoftheart solvers. The time complexity of the algorithm shall depend, among other things, on that of the maximin/minimax optimization technique in use, and on how many trials are required before convergence. We also currently lack empirical comparisons of the resulting algorithm with existing zsPOSG solution techniques.
Several implementation details could be further discussed as the maximin/minimax errorbounded optimization algorithm, the need to regularly prune dominated cones in and , and the possible use of compression techniques to reduce the dimensionality of the occupancy subspaces, as in FBHSVI [9].
Regarding execution, as in singleagent or collaborative multiagent settings, while exploration is guided by optimistic decisions (greediness w.r.t. for and for ), actual decisions should be pessimistic, i.e., should act “greedily” w.r.t. , and w.r.t. .
Handling finitehorizon settings requires little changes. The maximum length of trials shall be the minimum between this horizon and the bound that depends on and . Additionally considering shall require revising the Lipschitz constants and some other formulas.
As often with DecPOMDPs [30, 9], each player’s strategy is here historydependent, because one could not come up with private belief states, which is feasible under certain assumptions [15, 14]. One could possibly address this issue as MacDermed and Isbell [20] did by assuming that a bounded number of beliefs is sufficient to solve the problem.
Public actions and observations, as in Poker, could be exploited by turning the nonobservable sequential decision problem faced by the central planner into a partially observable one, and thus the deterministic OMG into a probabilistic one.
References

[1]
N. Basilico, G. De Nittis, and N. Gatti.
A security game combining patrolling and alarm–triggered responses
under spatial and detection uncertainties.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [2] A. Basu and L. Stettner. Finite and infinitehorizon Shapley games with nonsymmetric partial observation. SIAM Journal on Control and Optimization, 53(6):3584–3619, 2015.
 [3] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Science, 38:716–719, 1952.
 [4] D. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
 [5] N. Brown and T. Sandholm. Superhuman AI for headsup nolimit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
 [6] O. Buffet, J. Dibangoye, A. Saffidine, and V. Thomas. Heuristic search value iteration for zerosum stochastic games. IEEE Transactions on Games, 2020.
 [7] K. Chatterjee and L. Doyen. Partialobservation stochastic games: How to win when belief fails. volume 15, page 16, 2014.
 [8] H. L. Cole and N. Kocherlakota. Dynamic games with hidden actions and hidden states. Journal of Economic Theory, 98(1):114–126, 2001.
 [9] J. Dibangoye, C. Amato, O. Buffet, and F. Charpillet. Optimally solving DecPOMDPs as continuousstate MDPs. Journal of Artificial Intelligence Research, 55:443–497, 2016.
 [10] M. Fehr, O. Buffet, V. Thomas, and J. Dibangoye. POMDPs have Lipschitzcontinuous optimal value functions. In Advances in Neural Information Processing Systems 31, pages 6933–6943, 2018.
 [11] D. Fudenberg and J. Tirole. Game Theory. The MIT Press, 1991.
 [12] M. K. Ghosh, D. McDonald, and S. Sinha. Zerosum stochastic games with partial information. Journal of Optimization Theory and Applications, 121(1):99–118, Apr. 2004.
 [13] E. A. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, 2004.
 [14] K. Horák and B. Bošanský. Solving partially observable stochastic games with public observations. In Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence, pages 2029–2036, 2019.
 [15] K. Horák, B. Bošanský, and M. Pěchouček. Heuristic search value iteration for onesided partially observable stochastic games. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pages 558–564, 2017.
 [16] D. Koller and N. Megiddo. The complexity of twoperson zerosum games in extensive form. Games and Economic Behavior, 4(4):528–552, 1992.
 [17] D. Koller, N. Megiddo, and B. von Stengel. Efficient computation of equilibria for extensive twoperson games. Games and Economic Behavior, 14(51):220–246, 1996.
 [18] H. W. Kuhn. Simplified twoperson Poker. In H. W. Kuhn and A. W. Tucker, editors, Contributions to the Theory of Games, volume 1. Princeton University Press, 1950.
 [19] H. W. Kuhn. Extensive Games and the Problem of Information, volume Contributions to the Theory of Games II of Annals of Mathematics (AM28), pages 193–216. Princeton University Press, 1953.
 [20] L. C. MacDermed and C. Isbell. Point based value iteration with optimal belief compression for DecPOMDPs. In Advances in Neural Information Processing Systems 26, 2013.
 [21] E. Machuca. An analysis of multiobjective search algorithms and heuristics. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence, 2011.
 [22] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and infinitehorizon partially observable Markov decision problems. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.

[23]
R. Munos.
From bandits to MonteCarlo Tree Search: The optimistic principle
applied to optimization and planning.
Foundations and Trends in Machine Learning
, 7(1):1–130, 2014.  [24] F. Oliehoek and N. Vlassis. DecPOMDPs and extensive form games: equivalence of models and algorithms. Technical Report IASUVA0602, Intelligent Systems Laboratory Amsterdam, University of Amsterdam, 2006.
 [25] K. Åström. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174 – 205, 1965.
 [26] L. S. Shapley. Stochastic games. Proceedings of the National Academy of Science, 39(10):1095–1100, 1953.
 [27] Y. Shoham and K. LeytonBrown. Multiagent Systems: Algorithmic, GameTheoretic, and Logical Foundations. Cambridge University Press, 2009.
 [28] T. Smith. Probabilistic Planning for Robotic Exploration. PhD thesis, The Robotics Institute, Carnegie Mellon University, 2007.
 [29] B. S. Stewart and C. C. White, III. Multiobjective A*. Journal of the ACM, 38(4):775–814, Oct. 1991.
 [30] D. Szer, F. Charpillet, and S. Zilberstein. MAA*: A heuristic search algorithm for solving decentralized POMDPs. In Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence, pages 576–583, 2005.
 [31] B. von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(50):220–246, 1996.
 [32] A. Wiggers, F. Oliehoek, and D. Roijers. Structure in the value function of twoplayer zerosum games of incomplete information. In Proceedings of the Twentysecond European Conference on Artificial Intelligence, pages 1628–1629, 2016.
 [33] A. Wiggers, F. Oliehoek, and D. Roijers. Structure in the value function of twoplayer zerosum games of incomplete information. Computing Research Repository, abs/1606.06888, 2016.
 [34] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2007.
Appendix A Appendix
This appendix mainly provides proofs of several theoretical claims of the paper.
a.1 From zsPOSGs to zsOMGs
The following result shows that the occupancy state is Markovian, i.e., its value at only depends on its previous value (), the system dynamics (), and the last behavioral decision rules ( and ).
Lemma 1.
Given an occupancy state and a behavioral decision rule profile , next occupancy state is given by the following formula (for any , , , , , , ):
Proof.
The proof goes by simply developing the definition:
Comments
There are no comments yet.