# On Bellman's Optimality Principle for zs-POSGs

Many non-trivial sequential decision-making problems are efficiently solved by relying on Bellman's optimality principle, i.e., exploiting the fact that sub-problems are nested recursively within the original problem. Here we show how it can apply to (infinite horizon) 2-player zero-sum partially observable stochastic games (zs-POSGs) by (i) taking a central planner's viewpoint, which can only reason on a sufficient statistic called occupancy state, and (ii) turning such problems into zero-sum occupancy Markov games (zs-OMGs). Then, exploiting the Lipschitz-continuity of the value function in occupancy space, one can derive a version of the HSVI algorithm (Heuristic Search Value Iteration) that provably finds an ϵ-Nash equilibrium in finite time.

## Authors

• 8 publications
• 6 publications
• 2 publications
• 7 publications
• 4 publications
10/25/2021

### HSVI fo zs-POSGs using Concavity, Convexity and Lipschitz Properties

Dynamic programming and heuristic search are at the core of state-of-the...
10/21/2020

### Solving Zero-Sum One-Sided Partially Observable Stochastic Games

Many security and other real-world situations are dynamic in nature and ...
06/22/2016

### Structure in the Value Function of Two-Player Zero-Sum Games of Incomplete Information

Zero-sum stochastic games provide a rich model for competitive decision ...
01/16/2013

### Fast Planning in Stochastic Games

Stochastic games generalize Markov decision processes (MDPs) to a multia...
02/25/2020

### On Reinforcement Learning for Turn-based Zero-sum Markov Games

We consider the problem of finding Nash equilibrium for two-player turn-...
04/08/2022

### The Complexity of Infinite-Horizon General-Sum Stochastic Games

We study the complexity of computing stationary Nash equilibrium (NE) in...
05/14/2020

### Comparison of Information Structures for Zero-Sum Games and a Partial Converse to Blackwell Ordering in Standard Borel Spaces

In statistical decision theory involving a single decision-maker, one sa...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bellman’s optimality principle (Bellman’s OP) [3] led to state-of-the-art solvers in many non-trivial sequential decision-making problems, assuming partial observability [25], multi-objective criteria [29, 21], collaborating agents, e.g.

, modeled as decentralized partially observable Markov decision processes (Dec-POMDPs)

[13, 30, 9], or some non-collaborative perfect information games (from Shapley’s seminal work [26] to [6]). In all these settings this principle exploits the fact that sub-problems are nested recursively within the original problem. An open question is whether—and how—it could be applied to imperfect information games, which are encountered in diverse applications such as Poker [18] or security games [1]. This paper answers this question in the setting of 2-player zero-sum partially observable stochastic games (zs-POSGs), i.e., imperfect information games with simultaneous moves, perfect recall, discounted rewards and a possibly infinite time horizon.

As general POSGs and Dec-POMDPs, infinite-horizon zs-POSGs are undecidable, and their finite-horizon approximations are in NEXP [22, 4]. As further discussed in Section 2, solution techniques for finite-horizon POSGs, or other impartial information games that can be formulated as extensive-form games (EFGs), typically solve an equivalent normal-form game [27] or use a dedicated regret-minimization mechanism [34, 5]. They thus do not rely on Bellman’s optimality principle, except (i) a dynamic programming approach that only constructs sets of non-dominated solutions [13], (ii) in collaborative problems (Decentralized POMDPs), adopting the viewpoint of a (blind) central planner [30, 9], and (iii) for (mostly 2-player zero-sum) settings with observability assumptions such that one can reason on player beliefs [12, 7, 2, 15, 8, 14]. Here, we do not make any assumption beyond the game being 2-player zero-sum, in particular regarding observability of the state and actions.

As for a number of Dec-POMDP solvers, our approach adopts the viewpoint not of a player, but of a central (offline) planner that prescribes individual strategies to the players [30], which allows turning a zs-POSG into a non-observable game for which Bellman’s optimality principle applies. This is achieved in Section 4 (after background Section 3) while reasoning not on a player’s belief over the game state (as feasible in POMDPs or some particular games), but on the central planner’s (blind) belief, a statistic called occupancy state and that we prove to be sufficient for optimal planning, as Dibangoye et al. did for Dec-POMDPs [9]. In Section 5, our Bellman/Shapley operator is proved to induce an optimal game value function that is Lipschitz-continuous in occupancy space, which leads to deriving value function approximators, including upper- and lower-bounding ones, and discussing their initialization. Finally, Section 6 describes a variant of HSVI for zs-POSGs, and demonstrates its finite-time convergence to an -optimal solution despite the continuous (occupancy) state and action spaces.

## 2 Related Work

Infinite horizon POSGs are undecidable [22], which justifies searching for near-optimal solutions, e.g., through finite horizon solutions, as we will do. There is little work on solving POSGs, in particular through exploiting Bellman’s optimality principle. One exception is Hansen and Zilberstein’s work on finite horizon POSGs [13] , where dynamic programming (DP) incrementally constructs non-dominated policy trees for each player, which allows then deriving a solver for common-payoff POSGs, i.e., decentralized partially observable Markov decision processes (Dec-POMDPs). Here, Bellman’s OP thus serves as a pre-processing phase, while we aim at employing it in the core of algorithms.

##### Dec-POMDPs

Bellman’s OP appears as the core component of a Dec-POMDP solver when Szer et al. [30] adopt a planner-centric viewpoint whereby the planner aims at providing the players with their private policies without knowing which action-observation histories they have experienced. The planner’s information state at thus contains the initial belief and the joint policy up to . This leads to turning a Dec-POMDP into an information-state MDP, and obtaining a deterministic shortest path problem that can be solved using an A* search called MAA* (multi-agent A*).

Then, another important step is when Dibangoye et al. [9] show that (i) the occupancy state, a statistic used to compute expected rewards in MAA*, is in fact sufficient for planning, and (ii) the optimal value function is piecewise linear and convex (PWLC) in occupancy space, which allows adapting point-based POMDP solvers using approximators of .

##### Subclasses of POSGs

Recent works addressed particular cases of discounted partially observable stochastic games (POSGs), 2-player and zero-sum if not specified otherwise, exploiting the structure of the problem to turn it into an equivalent problem for which Bellman’s principle applies. Ghosh et al. [12] considered POSGs with public actions and shared observations, which can be turned into stochastic games defined over the common belief space, similarly to POMDPs turned into belief MDPs. Chatterjee and Doyen [7], Basu and Stettner [2], and Horák et al. [15] considered One-Sided POSGs, i.e., scenarios where (player) (w.l.o.g.) only partially observes the system state, and has access to the system state, plus the action and observation of . Cole and Kocherlakota [8] considered (-player) POSGs with independent private states, partially shared observability, and ’s utility function depending on his private state and on the shared observation. Horák and Bošanský [14] considered zs-POSGs with independent private states and public observations, i.e., scenarios where (i) each player has a private state he fully observes, and (ii) both players receive the same public observations of each player’s private state. Any player’s belief over the other player’s private state is thus common knowledge.

Focusing on the work of Horák et al. [15, 14], in both cases convexity or concavity properties of the optimal value function are obtained, which allow deriving upper- and lower-bounding approximators. These approximators are then employed in HSVI-based algorithms. Yet, moving from MDPs and POMDPs (as in Smith’s work) to these settings induces a tree of possible futures with an infinite branching factor, which requires changes to the algorithm, and thus to the theoretical analysis of the finite-time convergence. As we shall see, the present work adopts similar changes.

Wiggers et al. [32] prove that, using appropriate representations, the value function associated to a zs-POSG is convex for (maximizing) player and concave for (minimizing) player . Yet, this did not allow deriving a solver based on approximating the value function. Here, we exploit no convexity or concavity property of the optimal value function, as they may not hold, but its Lipschitz continuity.

##### Imperfect Information Games

Finite horizon (general-sum) POSGs can be written as extensive-form games with imperfect information and perfect recall (EFGs, often referred to as imperfect information games) [24], which makes solution techniques for EFGs relevant even for infinite-horizon POSGs. A first approach to solving EFGs is to turn them into a normal-form game before looking for a Nash equilibrium, thus ignoring the temporal aspect of the problem [27]

and inducing a combinatorial explosion. For (2-player) zs-EFGs, this leads to solving two linear programs (one for each player). Koller and Megiddo

[16] propose a different linear programming approach for zs-EFGs that exploits the temporal aspect through the choice of decision variables, but still does not apply Bellman’s OP (see also [31, 17]).

More recently, Counterfactual Regret minimization (CFR) [34] has been introduced, allowing to solve large imperfect-information games with bounded regret such as heads-up no limit hold’em poker, now winning against top human players [5]. While some CFR-based algorithms use heuristic-search techniques, thus somehow exploit the sequentiality of the game, they do not rely on Bellman’s OP either.

## 3 Background

For the sake of clarity, the concepts and results of the EFG literature used in this work will be recast in the POSG setting. We shall employ the terminology of pure/mixed/behavioral strategies and strategy profiles—more convenient in our non-collaborative setting—instead of deterministic or stochastic policies (private or joint ones)—common in the collaborative setting of Dec-POMDPs.

A (2-player) zero-sum partially observable stochastic game (zs-POSG) is defined by a tuple , where

• is a finite set of states;

• is (player) ’s finite set of actions;

• is ’s finite set of observations;

• is the probability to transition to state

and receive observations and when actions and are performed in state ;

• is a (scalar) reward function;

• is a temporal horizon;

• is a discount factor; and

• is the (public/common) initial belief state.

would like to maximize the expected return, defined as the discounted sum of future rewards, while would like to minimize it, what we formalize next.

From the Dec-POMDP, POSG and EFG literature, we use the following concepts and definitions, where :

is ’s opponent.

is a length- action-observation history for . The set of histories is , with one subset per time step.

is a joint history at time . The set of joint histories is , with one subset per time step.

[]

An occupancy state at time

is a probability distribution over state–joint-history pairs

. ( is completely specified by .) The set of occupancy states is , with one subset per time step. Note that this notion applies to POSGs despite the use of stochastic actions.

[]

A pure strategy for is a mapping from private histories in () to single private actions. By default, .

is a pure strategy profile.

[]

A mixed strategy for is a probability distribution over pure strategies. It is used by first sampling one of the pure strategies (at ), and then executing that strategy until .

is a mixed strategy profile.

[]

A (behavioral) decision rule at time for is a mapping from private histories in to distributions over private actions. For convenience, we will note the probability to pick action when facing history .

is a decision rule profile (, and noting ).

is a behavioral strategy for from time step to (included). By default, .

is a behavioral strategy profile.

[]

The value of a behavioral strategy profile in occupancy state (from time step on) is:

 V0(o0,β) =E[∞∑t=0γtRt|O0=o0,β],

where

is the random variable associated to the instant reward at time step

. [Note: This definition extends naturally to pure and mixed strategy profiles.]

The primary objective is here to find a Nash equilibrium strategy (NES), i.e., a mixed strategy profile such that no player has an incentive to deviate, which can be written:

 ∀μ1,V0(o0,μ1∗,μ2∗) ≥V0(o0,μ1,μ2∗), ∀μ2,V0(o0,μ1∗,μ2∗) ≤V0(o0,μ1∗,μ2).

In such a 2-player zero-sum game, all NESs have the same Nash-equilibrium value (NEV) .

Finite horizon POSGs being equivalent to EFGs with imperfect information and perfect recall, the following key result for EFGs applies to (finite ) POSGs:

###### Theorem 1.

[19, 11] In a game of perfect recall, mixed and behavioral strategies are equivalent. (More precisely: Every mixed strategy is equivalent to the unique behavioral strategy it generates, and each behavioral strategy is equivalent to every mixed strategy that generates it.)

## 4 Solving POSGs as Occupancy MGs

In this section, unless stated otherwise, we assume finite horizons and exact solutions (no error).

Here, we show (i) how a zs-POSG can be reformulated as a different zero-sum Markov game, and (ii) that Bellman’s optimality principle applies in this game.

### 4.1 From zs-POSGs to zs-OMGs

To solve a zs-POSG, we take the viewpoint of a central planner that searches offline for the best behavioral strategy profile before providing it to the players. This contrasts with Dec-POMDPs where deterministic strategy profiles suffice, and means exploring a (bounded) continuous space rather than a (finite) discrete one as for Dec-POMDPs. Such a planner grows a partial strategy by appending a decision rule profile .

Note that any partial strategy is in one-to-one correspondence with an occupancy state . So, the controlled process induced in occupancy space, where actions are decision rule profiles, is both deterministic and Markovian (see formal details about the dynamics below): applying in (i.e., appending it to ) leads to a unique . Also, the expected reward at time is linear in occupancy space (more precisely in the corresponding distribution over states). All this allows reasoning not on partial behavioral strategy profiles, but on occupancy states. The central planner will thus (i) infer occupancy states seen as “beliefs” over the possible situations (“situation” here meaning the current state and the players’ joint action-observation history ) which may have been reached, although without knowing what actually happened, and (ii) map each occupancy state to a decision rule profile telling the players how to act depending on their actual action-observation histories.111In contrast, in a POMDP, the belief state depends on the agent’s action-observation history, and is mapped to a single action. Each zs-POSG is thus turned into an equivalent game, called a zero-sum occupancy Markov game (zs-OMG)222We use (i) “Markov game” instead of “stochastic game” because the dynamics are not stochastic, and (ii) “partially observable stochastic game” to stick with the literature. formally defined by the tuple , where:

• is the set of occupancy states induced by the zs-POSG;

• is the set of decision rule profiles of the zs-POSG;

• is a deterministic transition function that maps each pair to the (only) possible next occupancy state ; formally (see Lemma 1 in App. A.1), ,

 T(oτ,βτ)(s′,(θ1τ,a1,z1),(θ2τ,a2,z2)) \rm\tiny def=Pr(s′,(θ1τ,a1,z1),(θ2τ,a2,z2)) =β1τ(θ1τ,a1)β2τ(θ2τ,a2)∑sPz1,z2a1,a2(s′|s)oτ(s,θ1τ,θ2τ));
• is a reward function naturally induced from the zs-POSG as the expected reward for the current occupancy state and decision rule profile:

 r(oτ,βτ)\rm\tiny def=E[r(S,A1,A2)|oτ,β1τ,β2τ] =∑s,θτoτ(s,θ1τ,θ2τ)∑a1,a2β1τ(θ1,a1)β2τ(θ2,a2)r(s,a1,a2);

we use the same notation for zs-POSGs as the context shall indicate which one is discussed;

• , , and are as in the zs-POSG.

Note first that, for convenience, we directly consider behavioral decision rules, which correspond to mixed strategies. Of course, at , ’s possible actions should be decision rules defined over histories that have non-zero probability in current . The dynamics being deterministic and the actions public, both players of that new game (also denoted and while these are different players) know the next state after each transition. But this is no standard zs Markov game also since (i) the mixture of two actions is equivalent to another action already in the (continuous) action space at hand, and (ii) at each time step, the state (occupancy) space is continuous.

We shall study the subgames of a zs-OMG, i.e., situations where some occupancy state has somehow been reached at time step , and the central solver is looking for rational strategies ( and ) to provide to the players. tells which action-observation histories each player could be facing with non-zero probability, and thus which are relevant for planning. We can then extend the definition of value function from time step only to any time step as follows (using behavioral strategies):

 Vτ(oτ,β1τ:H−1,β2τ:H−1) =E[∞∑t=τγt−τRt|Oτ=oτ,β1τ:H−1,β2τ:H−1].

Note that is in one-to-one relationship with a strategy profile , so that we can denote its concatenation with a .

### 4.2 Back to Mixed Strategies

We now re-introduce, and generalize, mixed strategies as a mathematical tool to handle subgames of a zs-OMG as normal-form games, and give some preliminary results.

For a given , let be an arbitrarily chosen mixed strategy profile that leads to (/is compatible with) , thus defined over time interval . To complete this mixed (prefix) strategy, the central planner should provide each player with a different (suffix) strategy to execute for each it could be facing. We now detail how to build an equivalent set of mixed (full) strategies for . Each of the pure (prefix) strategies used in (belonging to a set denoted ) can be extended by appending a different pure (suffix) strategy at each of its leaf nodes, which leads to a large set of pure strategies . Then, let be the set of mixed (full) strategies obtained by considering the distributions over that verify, ,

 ∑πi0:H−1∈Πi0:H−1(πi0:τ−1)μi0:H−1|oτ⟩(πi0:H−1) =μi0:τ−1|oτ⟩(πi0:τ−1). (1)

This is the set of mixed strategies compatible with .

[Proof in App. A.2]lemmalemEquivalence (originally stated on page 4.2) is convex and equivalent to the set of behavioral strategies , thus sufficient to search for a Nash equilibrium in .

While only future rewards are relevant when making a decision at , reasoning with mixed strategies defined from will be convenient because is linear in , which allows coming back to a standard normal-form game and applying known results.

In the remaining, we simply note (without index) the mixed strategies in , set which we now note . Also, since we shall work with local game , let us define: the set of ’s mixed strategies compatible with occupancy states reachable given and (with either or ). Then, (inclusion due to the latter sets being less constrained in their definition). As a consequence, if maximizing some function over ’s mixed strategies compatible with a given :

As can be easily demonstrated (cf. Lemma 2 in App. A.3), any Nash equilibrium solution of our original game induces a Nash equilibrium in any of its reachable subgames.333In contrast, a subgame perfect equilibrium requires a Nash equilibrium in any subgame reachable by some strategy profile, which is more constraining. But this does not tell whether Bellman’s optimality principle applies, what we discuss next.

### 4.3 Bellman’s Optimality Principle

For any and , let us define (i) a NE profile for the subgame at , (ii) the NE value of the subgame at any , and (iii) the local subgame at

 Q∗τ(oτ,βτ) \rm\tiny def=r(oτ,βτ)+γV∗τ+1(T(oτ,βτ)).

Then, given Nash equilibrium solutions for any , the applicability of Bellman’s optimality principle shall be proved if a Nash equilibrium of can be found by (i) solving the local subgame to get a decision rule profile and (ii) appending it to .

##### An Abnormal-Form Game?

A first question is whether this game is in fact a normal-form game, i.e., whether it could be defined by a payoff matrix over pure decision rules, payoffs for behavioral decision rules being obtained through linear mixtures.

is linear in each player’s decision rule space at each time step (i.e., in for any and ), but multilinear in each player’s behavioral strategy space (see Lemma 1 App. A.4.1), which suggests that may not be convex-concave (and thus not (bi)linear) in the space of decision rules at . As a consequence, we are possibly facing an abnormal-form game and cannot use von Neumann’s Minimax theorem.

##### Properties of the Maximin and Minimax Values

Rather than digging the convexity-concavity property further, we now show that computing the maximin and minimax values of induces finding a NE of given NEs for any .

[Proof in App. A.4.2]theoremlemAbnormalMaximinimax (originally stated on page 4.3) In the 2p zs abnormal-form game , the maximin and minimax values are both equal to i.e., as previously defined, the NEV for game —and correspond to a NES.

###### Proof.

(sketch) The proof relies on first developing the maximin of , then using (i) the equivalence of maximin and minimax for mixed strategies (when von Neumann’s minimax theorem applies), and (ii) the equivalence of mixed and behavioral strategies. ∎

[Proof in App. A.4.2]theoremcorAbnormalMaximinimax (originally stated on page 4.3) As in 2p zs normal-form games, game has at least one NES; all its NESs are all value-equivalent; and solving for maximin and minimax values allows finding one NES.

##### Maximin and Minimax Computation

The last results tell us that we can exploit knowledge of the optimal value function at (for all ) to find optimal decision rules at for any given by computing the maximin and minimax values of the local (abnormal-form) game at hand. Yet, we cannot use an LP as for normal-form games. To find an appropriate solution method, let us now look at properties of this game, noting that we lack any convexity/concavity property, and presenting a preliminary result.

[Proof in App. A.4.3]lemmalemOccLin (originally stated on page 4.3) At depth , is linear in , , and , where . It is more precisely -Lipschitz-continuous in (in -norm), i.e., for any , :

 ∥T(o′τ,βτ)−T(oτ,βτ)∥1 ≤1⋅∥o′τ−oτ∥1.

The Lipschitz continuity (LC) property would also hold in -norm or -norm, due to the equivalence between norms, but with different constants.

[Proof in App. A.4.3]lemmalemQLC (originally stated on page 4.3) For any and , is Lipschitz continuous in both and .

The payoff function of our game is thus LC in each private decision-rule space, which suggests using error-bounded global optimization techniques, as Munos’s DOO (Deterministic Optimistic Optimization) [23]. Here, searching for a maximin (resp. minimax) value suggests using two nested optimization processes: an “outer” one for the (resp. ) operator, and an “inner” one for the (resp. ). To ensure being within of the maximin value, each process could, for example, use an tolerance threshold. Yet, in such a nested optimization process, the inner process may stop, at each call, before reaching -optimality if it leads the outer process to explore a different point.

Due to the continuous state space of zs-OMGs, cannot be computed exactly. We shall now see how to approximate it, before exploiting the resulting approximators in a specific version of HSVI in Sec. 6.

## 5 Properties of V∗

In this section, we again assume finite horizon problems (unless stated otherwise). The main objective here is to propose upper- and lower-bounding approximators that exploit ’s Lipschitz continuity (rather than PWLC) property, as Fehr et al. [10] did in the setting of (single agent) information-oriented control, but here with simpler derivations.

### 5.1 Finite-Horizon Lipschitz Continuity of V∗

The following lemma proves that the expected instant reward at any is linear in , and thus so is the expected value of a finite-horizon strategy profile from onwards (trivial proof by induction). [Proof in App. A.5.1]lemmalemVLinOcc (originally stated on page 5.1) At depth , is linear w.r.t. .

[Proof in App. A.5.1]corollarycorVLCOcc (originally stated on page 5.1) is Lipschitz continuous in at any depth .

##### Refining the Lipschitz constant(s)

We have just discussed the LC of based on the LC of finite-horizon strategies, reasoning on worst case Lipschitz constants (one per time step) that hold for all strategies. Now, (i) could we refine those constants based on knowledge regarding , in particular upper and lower bounds and (see next sections)? And (ii) could we make use of those refined constants in the planning process?

Regarding question (i), and tell us that any strategy profile from time on (and thus with remaining horizon ) has values within and , hence the refined Lipschitz constant:

 λLUτ =Umaxτ−Lminτ2.

Regarding question (ii), as and are refined during the planning process, these refined depth-dependent constants would progressively shrink, thus speeding up planning! This phenomenon could encourage improving the value function bounds where they seem high (for ) or low (for ).

### 5.2 Approximating V∗

Note: For the sake of readability, the depth index may be omitted when it can be inferred from the occupancy state.

##### Approximators

An HSVI-like algorithm requires maintaining both an upper and a lower approximator of . We denote them and , and .

The LC of suggests employing LC function approximators for at depth () in the form of a lower envelope of (i) an initial upper-bound and (ii) downward-pointing L1-cones, where an upper-bounding cone —located at , with “summit” value , and slope —induces a function . The upper-bound is thus defined as the lower envelope of and the set of cones , i.e.,

 U(o) =min{U(0)(o),minω∈ΩUτU(ω)(o)}.

Respectively, for the lower-bounding approximator at depth : a lower-bounding (upward-pointing) cone induces a function ; and the lower bound is defined as the upper envelope of an initial lower bound and the set of cones , i.e.,

 L(o) =max{L(0)(o),maxω∈ΩLτL(ω)(o)}.
##### (Point-based) Operator and Value Updates

One cannot apply an operator (noted ) to update a value function approximator uniformly. Instead, when visiting some occupancy state (at depth ), we perform a point-based update of the upper-bound by (i) finding the NEV of the following game (which relies on at ):

 U(o,βτ) =∑s,a1,a2(∑θo(s,θ)β1(θ1,a1)β2(θ2,a2))r(s,a1,a2) +γU(T(o,βτ))

then (ii) adding a downward-pointing cone to . We note the upper bound after this update at point . The same applies to with upward-pointing cones instead, and using notation .

### 5.3 Initializations

Due to the symmetry between players in a zs-POSG, without loss of generality, let us look for an upper bound of the optimal value function , i.e., an optimistic bound (an admissible heuristic) for (maximizing) player 1. A usual approach to look for optimistic bounds is to relax the problem for the player at hand. To that end, one can here envision manipulating the players’ knowledge, their control over the system, the action ordering, or the opponent’s objective, e.g.:

1. providing more (e.g. full) state observability to 1;

2. providing less (e.g. no) state observability to 2;

3. letting 1 know what 2 observes;

4. letting 1 control chance (2’s choice would then only restrict the set of reachable states), but this would require that 1 has full observability;

5. letting 2 act first, and telling 1 about 2’s selected action (exactly or through a partial observation);

6. turning 2 into a collaborator by making him maximize, rather than minimize, the expected return.

Accounting for related Markov models for sequential decision-making, this suggests turning the zs-POSG at hand for example into:

• a Dec-POMDP by turning the opponent into a collaborator (or even into a POMDP or an MDP); or

• a One-Sided POSGs [15] by combining (i) full state observability, (ii) observability of 2’s observation, and (iii) observability of 2’s action.

Note that making both players’ actions or observations public (as in PO-POSGs [14]) would not be a viable solution as this would imply providing more knowledge to both players at the same time, which may prevent the resulting optimal value function from being an upper bound for our problem.

## 6 HSVI for zs-POSGs

In this section, we assume infinite horizon problems (unless stated otherwise) and -optimal solutions.

### 6.1 Algorithm

As we shall see, -optimally solving an -horizon zs-POSG amounts, as often, to solving a problem with finite horizon , which allows exploiting the results derived up to now. For convenience, we assume already known and use horizon-dependent constants (e.g., Lipschitz constants).

HSVI for zs-OMGs is detailed in Algorithm 1. As vanilla HSVI, it relies on (i) generating trajectories while acting optimistically (lines 11), i.e., player (resp. ) acting “greedily” w.r.t. (resp. ), and (ii) locally updating the upper- and lower-bounding approximators (lines 1 and 1). Here, computations of value updates and strategies rely on solving our local zero-sum abnormal form games (possibly a maximin/minimax optimization exploiting the Lipschitz continuity as discussed in Sec. 4.3). A key difference lies in the criterion for stopping trajectories. In vanilla HSVI (for POMDPs), the finite branching factor allows looking at the convergence of and at each point reachable under an optimal strategy. To ensure -convergence at , trajectories just need to be interrupted when the current width at (, where ) is smaller than a threshold . Here, dealing with an infinite branching factor, one may converge towards an optimal solution while always visiting new points of the occupancy space. Yet, as the sequence of generated (deterministic) trajectories converges to an optimal trajectory, the density of visited points around it increases, so that the Lipschitz approximation error tends to zero. One can thus bound the width within balls around visited points by exploiting the Lipschitz continuity of the optimal value function. As proposed by Horák et al. [15], this is achieved by adding a term to ensure that the width is below within a ball of radius around the current point (here the occupancy state ). Hence the threshold

 thr(τ) \rm\tiny def=γ−τϵ−τ∑i=12ρλτγ−i. (2)
##### Setting ρ

As can be observed, this threshold function should always return positive values, which requires a small enough . For a given problem, the maximum possible value shall depend on the Lipschitz constants at each time step, which themselves depend on the upper and lower bounds of the optimal value function (and thus may evolve during the planning process). For the sake of simplicity, let us consider a single Lipschitz constant common to all time steps, which always exists.

[Proof in App. A.6]lemmalemMaxRadius (originally stated on page 6.1) Assuming a single depth-independent Lipschitz constant , and noting that

 thr(τ) =γ−τϵ−2ρλγ−τ−11−γ, (3)

one can ensure positivity of the threshold at any by enforcing

We shall thus pick in . But what is the effect of setting to small or large values?

• The smaller , the larger , the shorter the trajectories, but the smaller the balls and the higher the required density of points around the optimal trajectory, thus the more trajectories needed to converge.

• The larger , the smaller , the longer the trajectories, but the larger the balls and the lower the required density of points around the optimal trajectory, thus the less trajectories needed to converge.

So, setting means making a compromise between the number of generated trajectories and their length.

### 6.2 Finite-Time Convergence

First, the following result bounds the length of HSVI’s trajectories using the bounded width of and the exponential growth of .

[Proof in App. A.6]lemmalemFiniteTrials (originally stated on page 6.2) Assuming a depth-independent Lipschitz constant , and with , the length of trajectories is upper-bounded by

 Tmax \rm\tiny def=⎡⎢ ⎢ ⎢ ⎢⎢logγϵ−2ρλ1−γW−2ρλ1−γ⎤⎥ ⎥ ⎥ ⎥⎥.

Note that (i) the classical upper-bound is retrieved when (Eq. (6.7) in [28]), and (ii) this gives us the maximum horizon needed to solve the problem. Now, knowing that any trial terminates in bounded time allows deriving the following results, in order.

[Proof in App. A.6]theoremlemThr (originally stated on page 6.2) Consider a trial of length and consider that the backward updates of and have not yet been performed. Then

1. , and

2. for every satisfying , it holds:

###### Theorem 2.

Algorithm 1 terminates with an -approximation of .

###### Proof.

(Adapted from [14]) Assume for the sake of contradiction that the algorithm does not terminate and generates an infinite number of explore trials. Since the length of a trial is bounded by a finite number , the number of trials of length (for some ) must be infinite. It is impossible to fit an infinite number of occupancy points satisfying within . Hence there must be two trials of length , and , such that . Without loss of generality, assume that was visited the first. According to Lemma 6.2, the point-based update in resulted in —which contradicts that the condition on line 1 of Algorithm 1 has not been satisfied for (and hence that was a trial of length ). ∎

Note that the number of trials could be (tediously) upper-bounded by determining how many balls of radius are required to cover occupancy simplexes at each depth.

## 7 Discussion

Inspired by techniques solving POMDPs as belief MDPs or Dec-POMDPs as occupancy MDPs, we have demonstrated that zs-POSGs could be turned into a new type of sequential game, namely zs-OMGs, allowing to apply Bellman’s optimality principle. Value function approximators (with heuristic initializations) can be used thanks to the Lipschitz continuity of , and despite possibly not being concave or convex in any relevant statistic. A variant of HSVI has been derived which provably converges in finite time to an -optimal solution.

This approach was motivated by the fact that the corresponding techniques for POMDPs and Dec-POMDPs provide state-of-the-art solvers. The time complexity of the algorithm shall depend, among other things, on that of the maximin/minimax optimization technique in use, and on how many trials are required before convergence. We also currently lack empirical comparisons of the resulting algorithm with existing zs-POSG solution techniques.

Several implementation details could be further discussed as the maximin/minimax error-bounded optimization algorithm, the need to regularly prune dominated cones in and , and the possible use of compression techniques to reduce the dimensionality of the occupancy subspaces, as in FB-HSVI [9].

Regarding execution, as in single-agent or collaborative multi-agent settings, while exploration is guided by optimistic decisions (greediness w.r.t. for and for ), actual decisions should be pessimistic, i.e., should act “greedily” w.r.t. , and w.r.t. .

Handling finite-horizon settings requires little changes. The maximum length of trials shall be the minimum between this horizon and the bound that depends on and . Additionally considering shall require revising the Lipschitz constants and some other formulas.

As often with Dec-POMDPs [30, 9], each player’s strategy is here history-dependent, because one could not come up with private belief states, which is feasible under certain assumptions [15, 14]. One could possibly address this issue as MacDermed and Isbell [20] did by assuming that a bounded number of beliefs is sufficient to solve the problem.

Public actions and observations, as in Poker, could be exploited by turning the non-observable sequential decision problem faced by the central planner into a partially observable one, and thus the deterministic OMG into a probabilistic one.

## References

• [1] N. Basilico, G. De Nittis, and N. Gatti. A security game combining patrolling and alarm–triggered responses under spatial and detection uncertainties. In

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

, 2016.
• [2] A. Basu and L. Stettner. Finite- and infinite-horizon Shapley games with nonsymmetric partial observation. SIAM Journal on Control and Optimization, 53(6):3584–3619, 2015.
• [3] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Science, 38:716–719, 1952.
• [4] D. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
• [5] N. Brown and T. Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
• [6] O. Buffet, J. Dibangoye, A. Saffidine, and V. Thomas. Heuristic search value iteration for zero-sum stochastic games. IEEE Transactions on Games, 2020.
• [7] K. Chatterjee and L. Doyen. Partial-observation stochastic games: How to win when belief fails. volume 15, page 16, 2014.
• [8] H. L. Cole and N. Kocherlakota. Dynamic games with hidden actions and hidden states. Journal of Economic Theory, 98(1):114–126, 2001.
• [9] J. Dibangoye, C. Amato, O. Buffet, and F. Charpillet. Optimally solving Dec-POMDPs as continuous-state MDPs. Journal of Artificial Intelligence Research, 55:443–497, 2016.
• [10] M. Fehr, O. Buffet, V. Thomas, and J. Dibangoye. -POMDPs have Lipschitz-continuous -optimal value functions. In Advances in Neural Information Processing Systems 31, pages 6933–6943, 2018.
• [11] D. Fudenberg and J. Tirole. The MIT Press, 1991.
• [12] M. K. Ghosh, D. McDonald, and S. Sinha. Zero-sum stochastic games with partial information. Journal of Optimization Theory and Applications, 121(1):99–118, Apr. 2004.
• [13] E. A. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, 2004.
• [14] K. Horák and B. Bošanský. Solving partially observable stochastic games with public observations. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, pages 2029–2036, 2019.
• [15] K. Horák, B. Bošanský, and M. Pěchouček. Heuristic search value iteration for one-sided partially observable stochastic games. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 558–564, 2017.
• [16] D. Koller and N. Megiddo. The complexity of two-person zero-sum games in extensive form. Games and Economic Behavior, 4(4):528–552, 1992.
• [17] D. Koller, N. Megiddo, and B. von Stengel. Efficient computation of equilibria for extensive two-person games. Games and Economic Behavior, 14(51):220–246, 1996.
• [18] H. W. Kuhn. Simplified two-person Poker. In H. W. Kuhn and A. W. Tucker, editors, Contributions to the Theory of Games, volume 1. Princeton University Press, 1950.
• [19] H. W. Kuhn. Extensive Games and the Problem of Information, volume Contributions to the Theory of Games II of Annals of Mathematics (AM-28), pages 193–216. Princeton University Press, 1953.
• [20] L. C. MacDermed and C. Isbell. Point based value iteration with optimal belief compression for Dec-POMDPs. In Advances in Neural Information Processing Systems 26, 2013.
• [21] E. Machuca. An analysis of multiobjective search algorithms and heuristics. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
• [22] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.
• [23] R. Munos. From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning.

Foundations and Trends in Machine Learning

, 7(1):1–130, 2014.
• [24] F. Oliehoek and N. Vlassis. Dec-POMDPs and extensive form games: equivalence of models and algorithms. Technical Report IAS-UVA-06-02, Intelligent Systems Laboratory Amsterdam, University of Amsterdam, 2006.
• [25] K. Åström. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174 – 205, 1965.
• [26] L. S. Shapley. Stochastic games. Proceedings of the National Academy of Science, 39(10):1095–1100, 1953.
• [27] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.
• [28] T. Smith. Probabilistic Planning for Robotic Exploration. PhD thesis, The Robotics Institute, Carnegie Mellon University, 2007.
• [29] B. S. Stewart and C. C. White, III. Multiobjective A*. Journal of the ACM, 38(4):775–814, Oct. 1991.
• [30] D. Szer, F. Charpillet, and S. Zilberstein. MAA*: A heuristic search algorithm for solving decentralized POMDPs. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 576–583, 2005.
• [31] B. von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(50):220–246, 1996.
• [32] A. Wiggers, F. Oliehoek, and D. Roijers. Structure in the value function of two-player zero-sum games of incomplete information. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, pages 1628–1629, 2016.
• [33] A. Wiggers, F. Oliehoek, and D. Roijers. Structure in the value function of two-player zero-sum games of incomplete information. Computing Research Repository, abs/1606.06888, 2016.
• [34] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2007.

## Appendix A Appendix

This appendix mainly provides proofs of several theoretical claims of the paper.

### a.1 From zs-POSGs to zs-OMGs

The following result shows that the occupancy state is Markovian, i.e., its value at only depends on its previous value (), the system dynamics (), and the last behavioral decision rules ( and ).

###### Lemma 1.

Given an occupancy state and a behavioral decision rule profile , next occupancy state is given by the following formula (for any , , , , , , ):

 oτ(s′,(θ1τ−1,a1,z1),(θ2τ−1,a2,z2)) =β1τ−1(θ1τ−1,a1)⋅β2τ−1(θ2τ−1,a2)∑sPz1,z2a1,a2(s′|s)⋅oτ−1(s,θ1τ−1,θ2τ−1).
###### Proof.

The proof goes by simply developing the definition:

 oτ(s′,(θ1τ−1,a1,z1),(θ2τ−1,a2,z2)) \rm\tiny def=Pr(s′,(θ1τ−1,a1,z1),(θ2τ−1,a2,z2)) =∑sPr(s,s′,(θ1τ−1,a1,z1),(θ2τ−1,a2,z2)) =∑sPr(s′,z1,z2|s,θ1τ−1,a1,θ2τ−1,a2)⋅Pr(s,θ1τ−1,a1,θ2τ−1,a2) =∑sPz1,z2a1,a2(s′|s)⋅Pr(a1,a2|s,θ1τ−1,θ2τ−1)⋅Pr(s,θ1τ−1,θ2τ−1) =∑sPz1,z2a1,a2(s′|s)⋅Pr(a1|s,θ1τ−1,θ2τ−1)⋅Pr(a2|s,θ1τ−1,θ2τ−1)⋅Pr(s,θ1τ−1,θ2τ−1) =∑sPz1,z2a1,a2(s′|s)⋅β1τ−1(θ1τ−1,a1)⋅β2τ−1(θ2τ−1,a2)⋅oτ−1(s,θ1τ−1,θ2τ−1) =β1τ−1(θ1τ−1,a1)⋅β2