 # Toward Solving 2-TBSG Efficiently

2-TBSG is a two-player game model which aims to find Nash equilibriums and is widely utilized in reinforced learning and AI. Inspired by the fact that the simplex method for solving the deterministic discounted Markov decision processes (MDPs) is strongly polynomial independent of the discounted factor, we are trying to answer an open problem whether there is a similar algorithm for 2-TBSG. We develop a simplex strategy iteration where one player updates its strategy with a simplex step while the other player finds an optimal counterstrategy in turn, and a modified simplex strategy iteration. Both of them belong to a class of geometrically converging algorithms. We establish the strongly polynomial property of these algorithms by considering a strategy combined from the current strategy and the equilibrium strategy. Moreover, we present a method to transform general 2-TBSGs into special 2-TBSGs where each state has exactly two actions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov decision process (MDP) is a widely used model in machine learning and operations research

, which establishs basic rules of reinforcement learning. While solving an MDP focuses on maximizing (minimizing) the total reward (cost) for only one player, we consider a broader class of problems, the 2-player turn based stochastic games (2-TBSG) , which involves two players with opposite objectives. One player aims to maximize the total reward, and the other player aims to minimize the total reward. MDP and 2-TBSG have many useful applications, see [8, 3, 12, 2, 5, 10, 16].

Similar to MDP, every 2-TBSG has its state set and action set, both of which are divided into two subsets for each player, respectively. Moreover, its transition probability matrix describes the transition distribution over the state set conditioned on the current action, and its reward function describes the immediate reward when taking the action.

We use a strategy (policy) to denote a mapping from the state set into the action set. In our setting, we focus on the discounted 2-TBSG, where the reward in later steps is multiplied by a discounted factor. Given strategies (policies) for both players, the total reward is defined to be the sum of all discounted rewards. We solve a 2-TBSG by finding its Nash equilibrium strategy (equilibrium strategy for short), where the first player cannot change its own strategy to obtain a larger total reward, and the second player cannot change its own strategy to obtain a smaller total reward. MDP can be viewed as a special case of 2-TBSG, where all states belong to the first player. In such cases, the equilibrium strategy agrees with the optimal policy of MDP.

MDPs have their linear programming (LP) formulations

. Hence algorithms solving LP problems can be used to solve MDPs. One of the most commonly used algorithm in MDP is the policy iteration algorithm , which can be viewed as a parallel counterpart of the simplex method solving the corresponding LP. In paper , both the simplex method solving the corresponding LP and the policy iteration algorithms have been proved to find the optimal policy in , where are the number of actions, the number of states and the discounted factor, respectively. Later in , the bound for the policy iteration algorithm is improved by a factor to . In , this bound is improved to . When the MDP is deterministic (all transition probabilities are either or ), a strongly polynomial bound independent on the discounted factor is proved in  for the simplex policy iteration method (each iteration changes only one action): for uniform discounted MDPs and for nonuniform discounted MDPs.

However, there is no simple LP formulation for 2-TBSGs. The strategy iteration algorithm , an analogue to the policy iteration, is a commonly used algorithm in finding the equilibrium strategy of 2-TBSGs. It is a strongly polynomial time algorithm first proved in  with a guarantee to find the equilibrium in iterations if the discounted factor is fixed. When the discounted factor is not fixed, an exponential lower bound is given for the policy iteration in MDP  and for the strategy iteration in 2-TBSG . It is an open problem whether there is a strongly polynomial algorithm whose complexity is independent of the discounted factor for 2-TBSG.

Motivated by the strongly polynomial simplex algorithm for solving MDPs, we present a simplex strategy iteration algorithm and a modified simplex strategy iteration algorithm for the 2-TBSG. In both algorithms each player updates in turn, where the second player always finds the best counterstrategy in its turn. In the simplex strategy iteration algorithm the first player updates its strategy using the simplex algorithm. In the modified simplex strategy iteration algorithm, the first player updates the action leading to the largest improvement after the second player finds the optimal counterstrategy. When the second player is trivial, the 2-TBSG becomes an MDP and the simplex strategy iteration algorithm can find its solution in strongly polynomial time independent of the discounted factor, which is a property not possessed by the strategy iteration algorithm in .

We also develop a proof technique to prove the strongly polynomial complexity for a class of geometrically converging algorithms. This class of algorithms includes the strategy iteration algorithm, the simplex strategy iteration algorithm, and the modified simplex strategy iteration algorithm. The complexity for the strategy iteration algorithm given in  can be recovered by our techniques. Our techniques use a combination of the current strategy and the equilibrium strategy. We establish a bound of ratio between the difference of value from the current strategy to the equilibrium strategy, and the difference of value from the combined strategy to the equilibrium strategy. Using this bound and the geometrically converging property, we can prove that after a certain number of iterations, one action will disappear forever, which leads to strongly polynomial convergence when the discount factor is fixed. Although we have not fully answered the open progblem, our algorithms and analysis point out a possible way for conquering the difficulities.

Furthermore, 2-TBSG where each state has exactly two actions can be transformed into a linear complementary problem . An MDP where each state has exactly two actions can be solved by a combinatorial interior point method . In this paper we present a way to transform a general 2-TBSG into a 2-TBSG where each state has exactly two actions. The number of states in this constructed 2-TBSG is (we use to hide log factors of ). This result enables the application of both results in [9, 17] to general cases.

The rest of this paper is organized as follows. In Section 2 we present some basic concepts and lemmas of the 2-TBSG. In Section 3 we describe the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm. The proof of complexity of the class of geometrically converging algorithm is given in Section 4. The transformation from general 2-TBSGs into special 2-TBSGs is introduced in Section 5.

## 2 Preliminaries

In this section, we present some basic concepts of 2-TBSG. Our focus here is on the discounted 2-TBSG, defined as follows.

###### Definition 2.1.

A discounted 2-TBSG (2-TBSG for short) consists of a tuple , where . are the state set and the action set of each player, respectively. is the transition probability matrix, where denotes the probability of the event that the next state is conditioned on the current action .

is the reward vector, where

denotes the immediate reward function received using action . To be convenient, we use to denote the number of actions, and to denote the number of states.

Given a state in 2-TBSG setting, we use to denote the set of available actions corresponding to state . A deterministic strategy (strategy for short) is defined such that are mappings from to and from to , respectively. Moreover, each state matches to an action in .

For a given strategy , we define the transition probability matrix and reward function with respect to . The -th row of is chosen to be the row of action in , and the -th element of is chosen to be the reward of action . It is easy to observe that the matrix

is a stochastic matrix. We next define the value vector and the modified reward function.

###### Definition 2.2.

The value vector of a given strategy is

 vπ1,π2=vπ=(I−γPπ)−1rπ.
###### Definition 2.3.

The modified reward function of a given strategy is defined as

 rπ=r−(J−γP)vπ,

where is defined as

 Jji={1if j∈Ai,0otherwise.

Furthermore, for a given 2-TBSG, the optimal counterstrategy against another player’s given strategy is defined in Definition 2.4. The equilibrium strategy is given in Definition 2.5.

###### Definition 2.4.

For player 2’s strategy , player 1’s strategy is the optimal counterstrategy against if and only if for any strategy of player 1, we have

 vπ1,π2≥vπ′1,π2.

Player 2’s optimal counterstrategy can be defined similarly: is the optimal counterstrategy against if and only if for any strategy , . Here for two value vector , we say () if and only if () for .

###### Definition 2.5.

A strategy is called an equilibrium strategy, if and only if is the optimal counterstrategy against , and is the optimal counterstrategy against .

To describe the property of equilibrium strategies, we present Theorems 2.6 and 2.7 given in [7, 15]. Theorem 2.6 indicates the existence of an equilibrium strategy.

###### Theorem 2.6.

Every 2-TBSG has at least an equilibrium strategy. If and are two equilibrium strategies, then . Furthermore, for any player 1’s strategy (or player 2’s strategy ), there always exists a player 2’s optimal counterstrategy against (player 1’s optimal counterstrategy against ), and for any two optimal counterstrategy (), we have ().

The next theorem points out a useful depiction of the value function at the equilibrium.

###### Theorem 2.7.

Let be an equilibrium strategy for 2-TBSG. If is a strategy of player 1, and is player 2’s optimal counterstrategy against , then we have . The equality holds if and only if is an equilibrium strategy.

We now define the flux vector of a given strategy .

###### Definition 2.8.

The flux of a given strategy is defined as

 (xπ)π=(I−γPπ)−T1, (xπ)a=0,∀a∈A,a∉π.

Our next lemma presents bounds and conditions of the flux vector, and the relationship among the value function, the flux vector and reduced costs. This lemma and the following several lemmas can be found in . To make the paper self-contained, we briefly give their proofs.

###### Lemma 2.9.

For any strategy , we have

1. ;

2. for any , ;

3. ;

4. , and moreover, .

###### Proof.

Item (1) is proved by

 1Txπ=1T(I−γPπ)−T1=[(I−γPπ)−11]T1=11−γ1T1=l1−γ.

Item (2) is due to

 (xπ)π−1=γPπ(I−γPπ)−T1≥0.

This indicates that , . Hence we have and from item (1). Finally the last two items are obtained from

 1Tvπ=1T(I−γPπ)−1rπ=rTπ(I−γPπ)−T1=rTπ(xπ)π=(xπ)Tr,

and

 vπ′−vπ =(I−γPπ′)−1rπ′−(I−γPπ′)−1(I−γPπ′)vπ; =(I−γPπ′)−1[r−(J−γP)vπ]π′=(I−γPπ′)−1(rπ)π′; 1T(vπ′−vπ) =1T(I−γPπ′)−1(rπ)π′=(xπ′)Tπ′(rπ)π′=(xπ′)Trπ.

In the following, we present a lemma indicating the positiveness or negativeness of the reduced costs of optimal counterstrategies and equilibrium strategies.

###### Lemma 2.10.
1. A strategy for player 1 is an optimal counterstrategy against player 2’s strategy if only if .

2. A strategy for player 2 is an optimal counterstrategy against player 1’s strategy if only if .

3. A strategy is an equilibrium strategy if and only if it satisfies:

 (rπ1,π2)A1≤0,(rπ1,π2)A2≥0.
###### Proof.

If satisfies , then for any player 1’s strategy , we have

 vπ′1,π2−vπ1,π2=(I−γPπ′1,π2)−1(rπ1,π2)π′1,π2=∞∑n=0γnPnπ′1,π2(rπ1,π2)π′1,π2≤0,

where the last inequality follows from for and .

Suppose that player 1’s strategy is the optimal counterstrategy against player 2’s strategy . For any , and , we let

 π′1(s1)={a′if s1=s;π1(s1)else.

Then again from Lemma 2.9 (4), we have

 xπ′1,π2a′rπa′=1T(vπ′1,π2−vπ1,π2)≤0,

where the inequality comes from the definition of equilibrium strategies. Since , we have , which indicates that

. With this estimation and

for , we have proved that . Hence, item (1) is established, and the proof of item (2) is similar. Finally item (3) follows from items (1) and (2) directly. ∎

## 3 Geometrically Converging Algorithms

Inspired by the simplex method solving the LP corresponding to the MDP and the strategy iteration algorithm given in , we propose a simplex strategy iteration (Algorithm 1) and a modified simplex strategy iteration algorithm (Algorithm 2) for 2-TBSG.

The simplex strategy iteration algorithm can be viewed as a generalization of the strongly polynomial simplex algorithm in solving MDPs . In our algorithm, both players update their strategies in turn. In each iteration, while the first player updates its strategy using the simplex method, which means only updating the action with the largest reduced cost, the second player updates its strategy according to the optimal counterstrategy. When the second player has only one possible action and the transition matrix is deterministic, the 2-TBSG reduces to a deterministic MDP. Then the simplex strategy iteration algorithm can find an equilibrium (optimal) strategy in strongly polynomial time independent of , which is a property has not been proven for the strategy iteration .

As for the modified simplex strategy iteration algorithm, it can be viewed as a modification of the simplex strategy iteration algorithm. In this algorithm, both players also update their strategies in turn, and the second player always finds the optimal counterstrategies in its moves. However, in each of the first player’s move, only the action is updated which leads to the biggest improvement on the value function when the second player uses the optimal counterstrategy.

It is easy to know that every iteration of the simplex strategy iteration algorithm involves a step of a simplex update and a solution to an MDP. And every iteration of the modified simplex strategy iteration algorithm involves solutions to multiple MDPs. Hence every iteration in both of these two algorithms can be solved in strongly polynomial time when the discounted factor is fixed.

Next we present a class of geometrically converging algorithms used for proving the strongly polynomial complexity for several algorithms in the next section.

###### Definition 3.1.

We say a strategy-update algorithm (algorithms which update strategies for both players in each iteration) is a geometrically converging algorithm with parameter a , if it updates a strategy to such that the following properties holds.

• is the optimal counterstrategy against ;

• ;

• If , then is an equilibrium strategy;

• The updates of this algorithm satisfies

 1T(vπ∗1,π∗2−vπn+M1,πn+M2)≤(1−γ)2n2⋅1T(vπ∗1,π∗2−vπn1,πn2).

To begin with, we exhibit a lemma indicating the geometrically converging property of the value function in the simplex strategy iteration algorithm.

###### Lemma 3.2.

Suppose the sequence of strategy generated by the simplex strategy iteration algorithm is . Then the following inequality holds

 1T(vπ∗−vπn+1)≤(1−1−γl)1T(vπ∗−vπn). (1)
###### Proof.

According to Algorithm 1, we have

 1T(vπn+1−vπn) ≥rπna1xπn+1a1≥rπna1≥1−γl∑a∈A1rπnaxπn+1a ≥1−γl∑a∈Arπnaxπn+1a=1−γl1T(vπ∗−vπn),

where the second and third inequalities follow from Lemma 2.9 (2) and the choice of , the fourth inequality follows from Lemma 2.10, and the first inequality and last equation are due to Lemma 2.9 (4) and Lemma 2.10. ∎

Using this lemma, we show in the next proposition that the strategy iteration algorithm, Algorithm 1 and Algorithm 2 all belong to the class of geometrically converging algorithms.

###### Proposition 3.3.
1. The strategy iteration algorithm given in  is a geometrically converging algorithm with parameter ;

2. The simplex strategy iteration algorithm (Algorithm 1) is a geometrically converging algorithm with parameter ;

3. The modified simplex strategy iteration algorithm (Algorithm 2) is a geometrically converging algorithm with parameter ;

###### Proof.

It is easy to verify that the previous described three algorithms satisfy the first three conditions in the definition of geometrically converging algorithms. Next, we prove that all of these algorithms satisfy the last condition. For the strategy iteration algorithm, according to Lemma 4.8 and Lemma 5.4 given in , we have

 1T(vπ∗−vπn+1)≤γ1T(vπ∗−vπn).

Hence if ( is a constant), then we obtain

 1T(vπ∗−vπn+M) ≤γM1T(vπ∗−vπn)≤γ−2logγn1−γ1T(vπ∗−vπn) =(1−γ)2l21T(vπ∗−vπn),

and the last condition of geometrically converging algorithms is verified.

For the simplex strategy iteration algorithm, if we choose ( is a constant), then according to inequality (1) we have

 1T(vπ∗−vπn+M)≤(1−γ)2l21T(vπ∗−vπn),

and the last condition of geometrically converging algorithms is verified.

Finally we consider the modified simplex strategy iteration algorithm. For , let , where is an action of state . Let

 π′1(s)={a1,if s=s1,πn(s),others,

be player 2’s optimal counterstrategy against , and . Then from inequality (1), we have

 1T(vπ∗−vπ′)≤(1−1−γn)1T(v∗−vπn).

According to Algorithm 2, we have

 1Tvπn+1≥1Tvπ′,

which leads to the following estimation:

 1T(vπ∗−vπn+1)≤(1−1−γn)1T(v∗−vπn).

Therefore, similar to the previous case we can choose such that

 1T(vπ∗−vπn+M)≤(1−γ)2l21T(vπ∗−vπn),

and the last condition of geometrically converging algorithms is verified. ∎

## 4 Strongly Polynomial Complexity of Geometrically Converging Algorithms

In this section, we develop the strongly polynomial property of geometric converging algorithms if the parameter is viewed as a constant. Slightly different from the proof in  for the strategy at the -th iteration, we present a proof by considering the strategy , where is an equilibrium strategy. We show that can be both upper and lower bounded by some proportion of . By applying the property of geometrically converging algorithms, we obtain that after a certain number of iterations, a player 1’s action will disappear in forever.

###### Theorem 4.1.

Any geometrically converging algorithm with a parameter finds the equilibrium strategy in

 O(Mm)

number of iterations.

###### Proof.

Suppose is the sequence generated by a geometrically converging algorithm. We define , where is one of the equilibrium strategy.

According to Lemma 2.10 and the fact that is the optimal counterstrategy against , and the definition of geometrically converging algorithm, we have

 1T(vπn+1−vπn)=∑a∈πn+11xπn+1arπna+∑a∈πn+12xπn+1arπna≥∑a∈πn+11xπn+1arπna≥0,

 1Tvπn≤1Tvπn+1. (2)

According to Lemma 2.10, we have

 1T(vπ∗−vηn)=1T(vπ∗1,π∗2−vπn1,π∗2)=−(xπn1,π∗2)Trπ∗1,π∗2=−(xπn1,π∗2)TA1rπ∗1,π∗2A1≥0, 1T(vηn−vπn)=1T(vπn1,π∗2−vπn1,πn2)=(xπn1,π∗2)Trπn1,πn2=(xπn1,π∗2)TA1rπn1,πn2A1≥0,

which implies

 1Tvπn≤1Tvηn≤1Tvπ∗. (3)

We next prove the following inequality:

 1T(vπ∗−vηn)≥1−γn⋅1T(vπ∗−vπn). (4)

A direct calculation gives

 1T(vπ∗−vπn) =1T(vπ∗1,π∗2−vπn1,πn2)=−(xπn1,πn2)Trπ∗1,π∗2 =−∑a∈πn1xπn1,πn2arπ∗1,π∗2a−∑a∈πn2xπn1,πn2arπ∗1,π∗2a≤−∑a∈πn1xπn1,πn2arπ∗1,π∗2a,

where the last inequality is obtained from Lemma 2.10. Then noticing that

 1≤xπn1,πn2a,xπn1,π∗2a≤1−γl,rπ∗1,π∗2a≤0,∀a∈πn1,

we have

 1T(vπ∗−vηn) =1T(vπ∗1,π∗2−vπn1,π∗2)=−(xπn1,π∗2)Trπ∗1,π∗2=−∑a∈πn1xπn1,π∗2arπ∗1,π∗2a ≥−1−γl∑a∈πn1xπn1,πn2arπ∗1,π∗2a≥1−γl1T(vπ∗−vπn).

Then the inequality (4) is proved.

Finally, we prove that for any , either there exists an action in will never belong to when , or we have

 1T(vπn+M+1−vπn+M)=0.

Actually for any , suppose , we obtain

 1T(vπ∗−vπn+p)<1T(vπ∗−vπn+M)≤(1−γ)2l21T(vπ∗−vπn)

from (2) and the definition of geometrically converging algorithm. Hence according to (3) and (4), we get

 1T(vπ∗−vηn+p)≤1T(vπ∗−vπn+p)<(1−γ)2l21T(vπ∗−vπn)≤1−γl1T(vπ∗−vηn). (5)

Therefore, choosing , and because for any , according to Lemma 2.10, we obtain

 1T(vπ∗−vηn)=−∑a∈πn1xηnarπ∗a≤⎛⎜⎝∑a∈πn1xηna⎞⎟⎠⋅(−rπ∗a1)≤−l1−γ⋅rπ∗a1

from Lemma 2.9. If , we have

 1T(vπ∗−vηn+p)=−∑a∈πn+p1xηn+parπ∗a≥−xηn+pa1rπ∗a1≥−rπ∗a1,

where the first inequality is due to Lemma 2.10 and the second inequality is due to Lemma 2.9. Therefore, combining these two inequalities and the inequality (5) and noticing that , we get

 −rπ∗a1≤1T(vπ∗−vηn+p)<1−γl1T(vπ∗−vηn)≤−1−γl⋅l1−γ⋅rπ∗a1=−rπ∗a1.

The previous derivation means that if does not hold for , then an action of must disappear after forever. Hence every after iterations an action will disappear forever. This process cannot happen for more than times (since there are actions and every strategy has actions), which indicates that for some ,

 1T(vπn+M+1−vπn+M)=0.

It follows from the definition of geometrically converging algorithm that is the equilibrium strategy. This indicates that within

 O(mM)

number of iterations, we can find one of the equilibrium strategies. ∎

Our next theorem presents the complexity of the strategy iteration algorithm, the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm.

###### Theorem 4.2.

The following algorithms has strongly polynomial convergence when the discounted factor is fixed.

• The strategy iteration algorithm given in  can find the equilibrium strategy within iterations;

• The simplex strategy iteration algorithm (Algorithm 1) can find the equilibrium strategy within